Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

Foreach Array and Html table extract

by doctordoctor (Initiate)
on Aug 14, 2012 at 18:13 UTC ( #987434=perlquestion: print w/ replies, xml ) Need Help??
doctordoctor has asked for the wisdom of the Perl Monks concerning the following question:

I'll preface this post by saying I haven't used perl in a few years and that was simply for an undergraduate web programming class. Currently I am trying to parse out data from SEC financial statements, which can be found in tables. I'll need to be searching through 200k+ documents and wanted a way to loop through a list of urls quickly. I can correctly get the array of urls set up, but trying to use html::tableextract for each item of the array is proving troublesome. Any suggestions would be greatly appreciated. Code is below:

#!/usr/bin/perl use 5.014; # so push/pop/etc work on scalars (experimental) use strict; use warnings; use LWP::Simple 'get'; use HTML::TableExtract; my $file = 'C:\Payout Policy Paper\Data\urllist.csv'; open (FH, "< $file") or die "Can't open $file for read: $!"; my @lines = <FH>; close FH or die "Cannot close $file: $!"; print @lines; foreach $line (@lines) { my $te = HTML::TableExtract->new( headers => [ 'Purchased','Average','Publicly','May'], slice_columns => 0,keep_html => 0,br_translate => 0 ); $te->parse($line); my $table = $te->first_table_found; use Data::Dump; dd $_ for $table->rows; }

I receive the error: Global symbol "$line" requires explicit package name at c:perlscripts\test.pl

Update 1: Thank you both for the quick responses, that moves me further along in the code, but the error I now receive is "Can't call method "rows" on an undefined value at C:\perlscripts\test.pl line 31"

Comment on Foreach Array and Html table extract
Download Code
Re: Foreach Array and Html table extract
by kcott (Abbot) on Aug 14, 2012 at 18:28 UTC
    I receive the error: Global symbol "$line" requires explicit package name at c:perlscripts\test.pl

    Adding a my to your foreach statement should fix this problem:

    foreach my $line (@lines) {

    -- Ken

Re: Foreach Array and Html table extract
by davido (Archbishop) on Aug 14, 2012 at 18:32 UTC

    The specific error message you're getting is because you don't actually declare $line. You could do what with foreach my $line ( @lines ) {....

    At some point you'll probably want to set up a parallel user agent. It's likely that your biggest bottleneck will be in fetching the documents, otherwise.

    If you ask a dozen individuals how to implement a parallel user agent you'll probably get a dozen different answers. Some will include explicit use of fork or threads, while others might recommend a module that works well for them. I've used both LWP::Parallel::UserAgent, and Mojolicious's built-in Mojo::UserAgent. I think a lot more ongoing work and maintenance has gone into the latter, and since I use Mojolicious for other purposes anyway (and as it can be installed in under a minute), I lean toward the Mojo::UserAgent approach nowadays. Mojo::UserAgent combined with Mojo::IOLoop (an event loop) and Mojo::DOM (HTML/XHTML DOM parser with CSS selector support) is a powerful ally.


    Dave

      Thank you for the quick response, that moves me further along in the code, but the error I now receive is "Can't call method "rows" on an undefined value at C:\perlscripts\test.pl line 31" <\p>

        Why don't you provide an updated snippet of code for us to play with, and some sample HTML that results in the error. Just wrap the HTML in code tags. It's easier to debug an error that we can easily reproduce.

        Also, you may want to use the 'debug' method from HTML::TableExtract to inspect the assertions your code makes about the state of affairs immediately before the call to 'rows'.


        Dave

Re: Foreach Array and Html table extract
by Cristoforo (Deacon) on Aug 14, 2012 at 20:55 UTC
    my @lines = <FH>;

    Judging by your code, I think the content of @lines are comma separated urls. I don't see anywhere you are getting the HTML content from any of the urls. I think you want to retrieve the webpage using LWP::Simple and then parse the required columns from that web page's table(s).

    Chris

    Update: looking at the sample web page http://www.sec.gov/Archives/edgar/data/826083/000082608312000011/dellq1fy1310q.htm , not sure which tables you want to parse. It looks like a tricky parse.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://987434]
Approved by kcott
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others contemplating the Monastery: (7)
As of 2014-08-29 03:30 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The best computer themed movie is:











    Results (275 votes), past polls