There was a discussion thread about this a couple of weeks ago.
- Don't use bare regular expressions unless the html pages are very simple and very consistent, as you will drive yourself mad trying to catch all the corner cases, and you will end up writing a buggy HTML parser.
- Instead go to CPAN and download an HTML parser that someone else has already written and debugged. HTML::TreeBuilder and HTML::TokeParser::Simple both come Highly recommended
- You should use a GUI HTML tree inspector such as Firebug, or the inspect element tool in google chrome, to tell you where the elements you are looking for are in the HTML structure.
PS: It won't be that quick. In my experience, HTML::TreeBuilder takes a significant fraction of a second to load an average HTML document, so multiply that by 50_000 files and you are looking at a day or so of processing time. Also you need to explicitly delete html trees once you are done with them, as the data structures generated by HTML::TreeBuilder will not be freed automatically when they go out of scope, and each will consume several megabytes of RAM.
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.
| & || & |
| < || < |
| > || > |
| [ || [ |
| ] || ] ||