Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

Re^2: how to quickly parse 50000 html documents?

by brengo (Acolyte)
on Nov 25, 2010 at 22:20 UTC ( #873733=note: print w/ replies, xml ) Need Help??


in reply to Re: how to quickly parse 50000 html documents?
in thread how to quickly parse 50000 html documents?

Thanks for all the links! Yes, I already looked a bit at HTML::TreeBuilder but as I understand zilch of it now I wanted to be sure that this is the right tool. The fact that the trees look powerful but with a huge overhead made me ask whether a regex would be faster. The web pages look consistent and a regex to give the number after the second occurence of "drill width:" should do fine.

Using a combination of grep|sed|tr|, each looped over each of the variables and all the files, my crapcode takes about 28hrs right now so everything that makes it faster is welcome.


Comment on Re^2: how to quickly parse 50000 html documents?
Replies are listed 'Best First'.
Re^3: how to quickly parse 50000 html documents?
by chrestomanci (Priest) on Nov 26, 2010 at 10:02 UTC

    I would agree that HTML::TreeBuilder looks daunting, but it is not that hard to use once you are used to it. Here is a snippet from a script I wrote recently that uses HTML::TreeBuilder to pull some data out of a table. (Feel free to copy it if you like.)

    sub parseResPage { my ( $rawHTML ) = @_; my $tree = HTML::TreeBuilder->new_from_content( $rawHTML ); my @tables = $tree->look_down('_tag', 'table'); # We wa +nt the second table my @tableRows = $tables[1]->look_down('_tag', 'tr'); # First ro +w is headings, then the data my $headRow = shift @tableRows; my @headings; my $res_hash; my @cells = $headRow->look_down('_tag', 'td'); push @headings, $_->as_text() foreach (@cells); foreach my $mainRow ( @tableRows ) { my @cells = $mainRow->look_down('_tag', 'td'); my $iface = $cells[0]->as_text(); for( my $i=0; $i<scalar@cells; $i++ ) { $res_hash->{$iface}{ $headings[$i] } = $cells[$i]->as_text +(); } } # Explicity free the memory consumed by the tree. $tree->delete(); return $res_hash; }

    Tip: If you are not already familiar with the perl command line debugger then now is the time to learn. When I am working with HTML::TreeBuilder code, my usual approach is to write a script that just loads the tree and sets a break point afterwards, and then start running $tree->look_down() commands interactively until I find a combination that gives me what I am looking for. I then paste that back into my editor and use it in my script.

    I suspect that if you write a script that uses HTML::TreeBuilder then it will probably end up being slower than your simple grep based script. HTML::TreeBuilder is well optimised perl written by some clever people, but it contains lots code to handle malformed HTML, and other corner cases, so it will be slower than a simple regular expression based script. Why are you so concerned about speed anyway? How much time have you spent on writing these scripts already?

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://873733]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others studying the Monastery: (16)
As of 2015-07-31 14:14 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (278 votes), past polls