Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

Comment on

( #3333=superdoc: print w/ replies, xml ) Need Help??

I would agree that HTML::TreeBuilder looks daunting, but it is not that hard to use once you are used to it. Here is a snippet from a script I wrote recently that uses HTML::TreeBuilder to pull some data out of a table. (Feel free to copy it if you like.)

sub parseResPage { my ( $rawHTML ) = @_; my $tree = HTML::TreeBuilder->new_from_content( $rawHTML ); my @tables = $tree->look_down('_tag', 'table'); # We wa +nt the second table my @tableRows = $tables[1]->look_down('_tag', 'tr'); # First ro +w is headings, then the data my $headRow = shift @tableRows; my @headings; my $res_hash; my @cells = $headRow->look_down('_tag', 'td'); push @headings, $_->as_text() foreach (@cells); foreach my $mainRow ( @tableRows ) { my @cells = $mainRow->look_down('_tag', 'td'); my $iface = $cells[0]->as_text(); for( my $i=0; $i<scalar@cells; $i++ ) { $res_hash->{$iface}{ $headings[$i] } = $cells[$i]->as_text +(); } } # Explicity free the memory consumed by the tree. $tree->delete(); return $res_hash; }

Tip: If you are not already familiar with the perl command line debugger then now is the time to learn. When I am working with HTML::TreeBuilder code, my usual approach is to write a script that just loads the tree and sets a break point afterwards, and then start running $tree->look_down() commands interactively until I find a combination that gives me what I am looking for. I then paste that back into my editor and use it in my script.

I suspect that if you write a script that uses HTML::TreeBuilder then it will probably end up being slower than your simple grep based script. HTML::TreeBuilder is well optimised perl written by some clever people, but it contains lots code to handle malformed HTML, and other corner cases, so it will be slower than a simple regular expression based script. Why are you so concerned about speed anyway? How much time have you spent on writing these scripts already?


In reply to Re^3: how to quickly parse 50000 html documents? by chrestomanci
in thread how to quickly parse 50000 html documents? by brengo

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?
    Username:
    Password:

    What's my password?
    Create A New User
    Chatterbox?
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others having an uproarious good time at the Monastery: (6)
    As of 2015-07-05 13:36 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?

      The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









      Results (67 votes), past polls