Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

Comment on

( #3333=superdoc: print w/ replies, xml ) Need Help??

There was a discussion thread about this a couple of weeks ago.

Short answer:

  • Don't use bare regular expressions unless the html pages are very simple and very consistent, as you will drive yourself mad trying to catch all the corner cases, and you will end up writing a buggy HTML parser.
  • Instead go to CPAN and download an HTML parser that someone else has already written and debugged. HTML::TreeBuilder and HTML::TokeParser::Simple both come Highly recommended
  • You should use a GUI HTML tree inspector such as Firebug, or the inspect element tool in google chrome, to tell you where the elements you are looking for are in the HTML structure.

PS: It won't be that quick. In my experience, HTML::TreeBuilder takes a significant fraction of a second to load an average HTML document, so multiply that by 50_000 files and you are looking at a day or so of processing time. Also you need to explicitly delete html trees once you are done with them, as the data structures generated by HTML::TreeBuilder will not be freed automatically when they go out of scope, and each will consume several megabytes of RAM.


In reply to Re: how to quickly parse 50000 html documents? by chrestomanci
in thread how to quickly parse 50000 html documents? by brengo

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • Outside of code tags, you may need to use entities for some characters:
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?
    Username:
    Password:

    What's my password?
    Create A New User
    Chatterbox?
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others romping around the Monastery: (6)
    As of 2014-09-15 03:18 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?

      My favorite cookbook is:










      Results (145 votes), past polls