Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change

Re^2: how to quickly parse 50000 html documents?

by brengo (Acolyte)
on Nov 25, 2010 at 22:20 UTC ( #873733=note: print w/replies, xml ) Need Help??

in reply to Re: how to quickly parse 50000 html documents?
in thread how to quickly parse 50000 html documents?

Thanks for all the links! Yes, I already looked a bit at HTML::TreeBuilder but as I understand zilch of it now I wanted to be sure that this is the right tool. The fact that the trees look powerful but with a huge overhead made me ask whether a regex would be faster. The web pages look consistent and a regex to give the number after the second occurence of "drill width:" should do fine.

Using a combination of grep|sed|tr|, each looped over each of the variables and all the files, my crapcode takes about 28hrs right now so everything that makes it faster is welcome.

  • Comment on Re^2: how to quickly parse 50000 html documents?

Replies are listed 'Best First'.
Re^3: how to quickly parse 50000 html documents?
by chrestomanci (Priest) on Nov 26, 2010 at 10:02 UTC

    I would agree that HTML::TreeBuilder looks daunting, but it is not that hard to use once you are used to it. Here is a snippet from a script I wrote recently that uses HTML::TreeBuilder to pull some data out of a table. (Feel free to copy it if you like.)

    sub parseResPage { my ( $rawHTML ) = @_; my $tree = HTML::TreeBuilder->new_from_content( $rawHTML ); my @tables = $tree->look_down('_tag', 'table'); # We wa +nt the second table my @tableRows = $tables[1]->look_down('_tag', 'tr'); # First ro +w is headings, then the data my $headRow = shift @tableRows; my @headings; my $res_hash; my @cells = $headRow->look_down('_tag', 'td'); push @headings, $_->as_text() foreach (@cells); foreach my $mainRow ( @tableRows ) { my @cells = $mainRow->look_down('_tag', 'td'); my $iface = $cells[0]->as_text(); for( my $i=0; $i<scalar@cells; $i++ ) { $res_hash->{$iface}{ $headings[$i] } = $cells[$i]->as_text +(); } } # Explicity free the memory consumed by the tree. $tree->delete(); return $res_hash; }

    Tip: If you are not already familiar with the perl command line debugger then now is the time to learn. When I am working with HTML::TreeBuilder code, my usual approach is to write a script that just loads the tree and sets a break point afterwards, and then start running $tree->look_down() commands interactively until I find a combination that gives me what I am looking for. I then paste that back into my editor and use it in my script.

    I suspect that if you write a script that uses HTML::TreeBuilder then it will probably end up being slower than your simple grep based script. HTML::TreeBuilder is well optimised perl written by some clever people, but it contains lots code to handle malformed HTML, and other corner cases, so it will be slower than a simple regular expression based script. Why are you so concerned about speed anyway? How much time have you spent on writing these scripts already?

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://873733]
[erix]: maybe I should have cobbled together a more simple example
[erix]: the main question is whether repeating groups get captured. I thought they did but it looks they do not.
[erix]: or they probably get overwritten when the repeating is done
[erix]: I'd better split by union/intersect/ except and submit the resulting parts to a simple regex

How do I use this? | Other CB clients
Other Users?
Others wandering the Monastery: (3)
As of 2018-01-19 08:41 GMT
Find Nodes?
    Voting Booth?
    How did you see in the new year?

    Results (216 votes). Check out past polls.