Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked

Re^3: how to quickly parse 50000 html documents?

by chrestomanci (Priest)
on Nov 26, 2010 at 10:02 UTC ( #873814=note: print w/replies, xml ) Need Help??

in reply to Re^2: how to quickly parse 50000 html documents?
in thread how to quickly parse 50000 html documents?

I would agree that HTML::TreeBuilder looks daunting, but it is not that hard to use once you are used to it. Here is a snippet from a script I wrote recently that uses HTML::TreeBuilder to pull some data out of a table. (Feel free to copy it if you like.)

sub parseResPage { my ( $rawHTML ) = @_; my $tree = HTML::TreeBuilder->new_from_content( $rawHTML ); my @tables = $tree->look_down('_tag', 'table'); # We wa +nt the second table my @tableRows = $tables[1]->look_down('_tag', 'tr'); # First ro +w is headings, then the data my $headRow = shift @tableRows; my @headings; my $res_hash; my @cells = $headRow->look_down('_tag', 'td'); push @headings, $_->as_text() foreach (@cells); foreach my $mainRow ( @tableRows ) { my @cells = $mainRow->look_down('_tag', 'td'); my $iface = $cells[0]->as_text(); for( my $i=0; $i<scalar@cells; $i++ ) { $res_hash->{$iface}{ $headings[$i] } = $cells[$i]->as_text +(); } } # Explicity free the memory consumed by the tree. $tree->delete(); return $res_hash; }

Tip: If you are not already familiar with the perl command line debugger then now is the time to learn. When I am working with HTML::TreeBuilder code, my usual approach is to write a script that just loads the tree and sets a break point afterwards, and then start running $tree->look_down() commands interactively until I find a combination that gives me what I am looking for. I then paste that back into my editor and use it in my script.

I suspect that if you write a script that uses HTML::TreeBuilder then it will probably end up being slower than your simple grep based script. HTML::TreeBuilder is well optimised perl written by some clever people, but it contains lots code to handle malformed HTML, and other corner cases, so it will be slower than a simple regular expression based script. Why are you so concerned about speed anyway? How much time have you spent on writing these scripts already?

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://873814]
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others meditating upon the Monastery: (4)
As of 2018-06-24 19:15 GMT
Find Nodes?
    Voting Booth?
    Should cpanminus be part of the standard Perl release?

    Results (126 votes). Check out past polls.