Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

www::mechanize to scrape randomly ordered data?

by punch_card_don (Curate)
on Sep 19, 2008 at 14:46 UTC ( [id://712532]=perlquestion: print w/replies, xml ) Need Help??

punch_card_don has asked for the wisdom of the Perl Monks concerning the following question:

Malthusian Monks,

Personal project - I'm thinking about next year's football pool. It seems an obvious thing for the Perl programmer in the pool to create a script that scrapes weekly stats from the league website and puts them into a DB for compilation & comparison. No more manual stats collection.

Never done this, personally. Some reading suggests that www::mechanize would be a good module for this task.

BUT - I've noticed that on the web page I'd want to scrape, the stats columns randomly change order on subsequent views. Either the programmer was just lazy and used

foreach $col (keys %columns) { print... }
producing randomly ordered output, OR the admins are intentionally trying to frustrate scrapers.

Anyway - my questions:

  • Is www::mechanize the best module for this application?
  • Will it, or any other module you suggest, automatically seek columns in data tables by name, regardless of order?

Thanks.




Time flies like an arrow. Fruit flies like a banana.

Replies are listed 'Best First'.
Re: www::mechanize to scrape randomly ordered data?
by whakka (Hermit) on Sep 19, 2008 at 15:10 UTC

    In the event of this random column changing you definitely need HTML::TreeBuilder. Straight-up string parsing will result in disaster - HTML::TreeBuilder, well, makes a proper tree structure out of the resulting HTML. Just pass in the content with $mech->content, have it parse, and follow the methods in HTML::Element. The one you want for tree traversal is look_down. The generalized way to go down the nodes is:

    my @table = $root->look_down( _tag => 'table', .... ); # <== insert ad +ditional attributes to specify the proper table while ( @table ) { my $node = shift @table; if ( ref $node ) { .... # possibly do stuff depending on $node->tag and $node->attr( +...) unshift @table, $node->content_list; } else { .... # here we have the leaf, or plain text } }

    Hope this helps.
Re: www::mechanize to scrape randomly ordered data?
by jettero (Monsignor) on Sep 19, 2008 at 14:55 UTC

    The keys of hashes are returned in seemingly-random order. That's normal. You can either keep track of the keys in an array, sort them, or use Tie::IxHash to preserve the order.

    Or did you mean the columns on the actual web page randomly change order? That'd be odd to say the least. Oh, yeah, you do. Wow.

    Mech is still the best choice, but you'll have to read the table headers to find the name of the column before you store it in your database I guess. There's a lot of choices for parsing the $mech->content: XML::XPath is my favorite, HTML::TreeBuilder might be helpful, and there's like 12 others.

    -Paul

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://712532]
Approved by toolic
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (4)
As of 2024-04-19 17:37 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found