Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

Why doesn't my scraper work?

by jdlev (Scribe)
on Nov 19, 2013 at 21:59 UTC ( [id://1063399]=perlquestion: print w/replies, xml ) Need Help??

jdlev has asked for the wisdom of the Perl Monks concerning the following question:

I'm just trying to get the program to find a table and return something if it does. Even though there are tons of tables on the page, it's still returning nothing when it's run? Any tips on where I might have screwed up my code?
my $html_file = get("http://www.cbssports.com/nfl/injuries/pup"); die "Couldn't Get HTML File!" unless defined $html_file; #print $html_file; for($depth = 0; $depth < 100; $depth++) { for($count = 0; $count < 100; $count++) { my $te = HTML::TableExtract->new( depth => $depth, coun +t => $count ) or die(print "Unable To Extract Table"); $te->parse($html_file) or die(print "Unable to parse st +ring"); foreach $ts ($te->tables) { print "Table found at "; foreach $row ($ts->rows) { print @$row; } } #print "Depth = " . $depth . " Count = " . $count . "\n"; } } #print "Injured Players Have Been Deleted From Database \n \n";
I love it when a program comes together - jdhannibal

Replies are listed 'Best First'.
Re: Why doesn't my scraper work?
by Old_Gray_Bear (Bishop) on Nov 20, 2013 at 00:01 UTC
    Take a look at the CBS Sports API. It is better to use the authorized tools to get your data than try to subvert the TOS scrape the site.

    Nota Bene: CBS provides a lot of Developer tools to develop your own Apps for the Fantasy Leagues. You might want to start with the "Create Applications" tab and go from there.

    Update -- I did a little wandering through CBS Sports site and found the Terms of Service document. The second and fifth bullet points address web-scrapping. It boils down to "Don't Do It".

    ----
    I Go Back to Sleep, Now.

    OGB

Re: Why doesn't my scraper work?
by talexb (Chancellor) on Nov 19, 2013 at 22:09 UTC

    My best guess is that there's some Javascript involved, whch makes things a lot more complicated when scraping is involved.

    You should also keep in mind that scraping a site like http://www.cbssports.com might be against their Terms Of Use. If there's an API that you can use instead, all the better.

    Alex / talexb / Toronto

    Thanks PJ. We owe you so much. Groklaw -- RIP -- 2003 to 2013.

      Is there a way for HTML TableExtract to look up a table with the attribute "class = etc"? I've tried that before and it seems it doesn't like looking for a class name?
      I love it when a program comes together - jdhannibal
        it seems it doesn't like looking for a class name
        In what way does it not like it? As long as you initialise the module with the attributes you want it should not have a problem:
        my $te = HTML::TableExtract->new( attribs=> { class=>'class-name' } ); $te->parse($html_string); for my $ts ($te->tables) { print "Table with class 'class-name' found\n"; }
        For others wondering I figured it out by using WWW::Mechanize as opposed to LWP::Simple when fetching the original data. It's at least saving the full code from the page now by using www::mechanize.
        I love it when a program comes together - jdhannibal

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1063399]
Approved by talexb
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others cooling their heels in the Monastery: (4)
As of 2024-04-20 00:53 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found