Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

Why doesn't my scraper work?

by jdlev (Scribe)
on Nov 19, 2013 at 21:59 UTC ( [id://1063399]=perlquestion: print w/replies, xml ) Need Help??

jdlev has asked for the wisdom of the Perl Monks concerning the following question:

I'm just trying to get the program to find a table and return something if it does. Even though there are tons of tables on the page, it's still returning nothing when it's run? Any tips on where I might have screwed up my code?
my $html_file = get("http://www.cbssports.com/nfl/injuries/pup"); die "Couldn't Get HTML File!" unless defined $html_file; #print $html_file; for($depth = 0; $depth < 100; $depth++) { for($count = 0; $count < 100; $count++) { my $te = HTML::TableExtract->new( depth => $depth, coun +t => $count ) or die(print "Unable To Extract Table"); $te->parse($html_file) or die(print "Unable to parse st +ring"); foreach $ts ($te->tables) { print "Table found at "; foreach $row ($ts->rows) { print @$row; } } #print "Depth = " . $depth . " Count = " . $count . "\n"; } } #print "Injured Players Have Been Deleted From Database \n \n";
I love it when a program comes together - jdhannibal

Replies are listed 'Best First'.
Re: Why doesn't my scraper work?
by Old_Gray_Bear (Bishop) on Nov 20, 2013 at 00:01 UTC
    Take a look at the CBS Sports API. It is better to use the authorized tools to get your data than try to subvert the TOS scrape the site.

    Nota Bene: CBS provides a lot of Developer tools to develop your own Apps for the Fantasy Leagues. You might want to start with the "Create Applications" tab and go from there.

    Update -- I did a little wandering through CBS Sports site and found the Terms of Service document. The second and fifth bullet points address web-scrapping. It boils down to "Don't Do It".

    ----
    I Go Back to Sleep, Now.

    OGB

Re: Why doesn't my scraper work?
by talexb (Chancellor) on Nov 19, 2013 at 22:09 UTC

    My best guess is that there's some Javascript involved, whch makes things a lot more complicated when scraping is involved.

    You should also keep in mind that scraping a site like http://www.cbssports.com might be against their Terms Of Use. If there's an API that you can use instead, all the better.

    Alex / talexb / Toronto

    Thanks PJ. We owe you so much. Groklaw -- RIP -- 2003 to 2013.

      Is there a way for HTML TableExtract to look up a table with the attribute "class = etc"? I've tried that before and it seems it doesn't like looking for a class name?
      I love it when a program comes together - jdhannibal
        it seems it doesn't like looking for a class name
        In what way does it not like it? As long as you initialise the module with the attributes you want it should not have a problem:
        my $te = HTML::TableExtract->new( attribs=> { class=>'class-name' } ); $te->parse($html_string); for my $ts ($te->tables) { print "Table with class 'class-name' found\n"; }
        For others wondering I figured it out by using WWW::Mechanize as opposed to LWP::Simple when fetching the original data. It's at least saving the full code from the page now by using www::mechanize.
        I love it when a program comes together - jdhannibal

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1063399]
Approved by talexb
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chilling in the Monastery: (7)
As of 2024-04-24 10:49 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found