Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

Help With Online Table Scraper

by jdlev (Scribe)
on Mar 02, 2011 at 04:40 UTC ( #890901=perlquestion: print w/ replies, xml ) Need Help??
jdlev has asked for the wisdom of the Perl Monks concerning the following question:

It's been a while since I programmed in perl, so please try to be fairly elementary in your explanations :)

I've been trying to extra data from the key statistics page of yahoo. Basically, I'm looking to create a custom stock screener. In order to do that, I've got to pull information from finance.yahoo.com. The key statistics has a lot of info in it I'm interested in that doesn't appear in normal screeners.

I've tried both HTML::TableExtract & HTML::TableExtractor to no avail. Looking at their setup is like trying to read french.

Here's my code so far

use LWP::Simple; use HTML::TableExtract; my $p = new HTML::TableExtract( depth => 0, count=>0, gridmap =>0 );
my $money = get("http://finance.yahoo.com/q/ks?s=MNDO+Key+Statistics");

if (!$money)
{print "Sorry, no data returned";}
else
{print "I found data!";}
print $p;

Thanks for helping to get me on the right track!

I love it when a program comes together - jdhannibal

Comment on Help With Online Table Scraper
Re: Help With Online Table Scraper
by GrandFather (Cardinal) on Mar 02, 2011 at 04:59 UTC

    You've got the basic bits, but you've put them together the wrong way and there's a lot of stuff missing. Here's a starting point:

    use strict; use warnings; use LWP::Simple; use HTML::TableExtract; my $money = get("http://finance.yahoo.com/q/ks?s=MNDO+Key+Statistics") +; my $p = new HTML::TableExtract(); $p->parse($money); my $table = $p->table(2, 0); for my $row ($table->rows ()) { ! defined and $_ = ' ' for @$row; print "@$row\n"; }

    Prints:

    Market Cap (intraday)5: 57.14M Enterprise Value (Mar 2, 2011)3: 39.06M Trailing P/E (ttm, intraday): 11.88 Forward P/E (fye Dec 31, 2012)1: N/A PEG Ratio (5 yr expected)1: N/A Price/Sales (ttm): 2.91 Price/Book (mrq): 2.41 Enterprise Value/Revenue (ttm)3: 1.96 Enterprise Value/EBITDA (ttm)3: 5.95
    True laziness is hard work
      ! defined and $_ = ' ' for @$row;
      You maybe mean ! defined and  $_ != ' ' for @$row;   ?

        No. That wouldn't make any sense at all!

        Perhaps I should have gone with my second instinct and written that line as:

        for my $cell (@$row) { if (! defined $cell) { $cell = ' '; } }

        I didn't because the fairly trivial task of setting undefined cells to a space becomes the dominant code in the sample. I'd hoped that the line would be pretty much ignored, but in retrospect that was pretty silly really.

        True laziness is hard work
Re: Help With Online Table Scraper
by Anonymous Monk on Mar 02, 2011 at 05:19 UTC
      Sinistral is right, use a documented API whenever available, scraping is a fragile PITA :) Come to think of it, Web::Scraper might be also be a bit of a PITA, but I've only studied the trivial examples, not the others
Re: Help With Online Table Scraper
by Sinistral (Prior) on Mar 02, 2011 at 14:28 UTC

    Before anyone starts talking about doing HTML scraping these days, it's always important to ask the question, "Is there a better way to do this?". The answer, direct from Yahoo, is yes, yes there is. Use web APIs, not HTML scraping (and YQL is your friend). This way, if Yahoo changes the syntax of their HTML (which can happen at a moment's notice), your tool will continue working.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://890901]
Approved by GrandFather
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others studying the Monastery: (6)
As of 2014-12-26 03:57 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (165 votes), past polls