in reply to how to quickly parse 50000 html documents?
There was a discussion thread about this a couple of weeks ago.
- Don't use bare regular expressions unless the html pages are very simple and very consistent, as you will drive yourself mad trying to catch all the corner cases, and you will end up writing a buggy HTML parser.
- Instead go to CPAN and download an HTML parser that someone else has already written and debugged. HTML::TreeBuilder and HTML::TokeParser::Simple both come Highly recommended
- You should use a GUI HTML tree inspector such as Firebug, or the inspect element tool in google chrome, to tell you where the elements you are looking for are in the HTML structure.
PS: It won't be that quick. In my experience, HTML::TreeBuilder takes a significant fraction of a second to load an average HTML document, so multiply that by 50_000 files and you are looking at a day or so of processing time. Also you need to explicitly delete html trees once you are done with them, as the data structures generated by HTML::TreeBuilder will not be freed automatically when they go out of scope, and each will consume several megabytes of RAM.