Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

Table scraping

by mcoblentz (Scribe)
on Nov 01, 2013 at 04:55 UTC ( #1060707=perlquestion: print w/ replies, xml ) Need Help??
mcoblentz has asked for the wisdom of the Perl Monks concerning the following question:

Fellow monks,

Trying to simply grab the job postings from a corporate careers section and am getting into an unusual table construct. I don't quite see how to get the table rows to come back.

I am getting a basic error: Can't call method "rows" on an undefined value at /opt/local/lib/perl5/site_perl/5.12.4/HTML/TableExtract.pm line 237.  

I have tried dumping the table but I don't understand the results from Dumper.

If I can get as far as listing out the job postings I'll be in good shape. However, at this juncture I'm stumped. The code I'm using is below:

#!/usr/bin/perl use strict; use warnings; use WWW::Mechanize; use Data::Dumper; use HTML::TableExtract; use XML::FeedPP; use UTF8; # initialize my $cols; my $url; my $depth; my $count; my $data; my $tracking_code; my $location; my $job_title; my $date_posted; my $out_fh; # get the data from the web. Typically this is: # https://commvault.silkroad.com/epostings/index.cfm?fuseaction=app.jo +bsearch# # Either pass this in as --url <page_url> when invoking or just set it +. $cols = 'tracking_code,job_title,location,date_posted'; $url = "https://commvault.silkroad.com/epostings/index.cfm?fuseaction= +app.jobsearch"; my $input; my $directory = "/Users/coblem/testing/"; my $outfile = "cvlt_jobs.csv"; open( $out_fh, '>', $ directory . $outfile) or die("Unable to create output file \"$out_fh\": $!\n"); my $m = WWW::Mechanize->new(); $m->get($url); $input = $m->content; my $te; if ( defined ($cols)) { print ("columns ", $cols, "\n"); my @headers = split(/,/, $cols); # $te = HTML::TableExtract->new( attribs => { border => 1 } ); $te = HTML::TableExtract->new( headers => [qw( tracking_code job_titl +e location date_posted )] ) or die qq{$!}; print Dumper($te); } else { $te = new HTML::TableExtract( depth => $depth, count=>$count); }; $te->parse($input); foreach my $row ($te->rows) { $tracking_code = $ { $row }[0]; $job_title = $ { $row }[1]; $location = $ { $row }[2]; $date_posted = $ { $row }[3]; print "positions: $tracking_code $job_title $location $date_posted + \n"; }

The page source HTML seems straightforward enough - it has a table definition in it and looks like this:

<tbody><tr class="cssSearchResultsColHead"> <td align="center" class="cssSearchResultsColHead" +><a id="header_trackingCode" href="index.cfm?fuseaction=app.jobsearch +&amp;newsort=1&amp;tcorder=asc&amp;thiscol=TRACKINGCODE&amp;company_i +d=15636&amp;version=2&amp;byBusinessUnit=NULL&amp;bycountry=0&amp;bys +tate=0&amp;byRegion=&amp;bylocation=&amp;keywords=&amp;byCat=&amp;pro +ximityCountry=&amp;postalCode=&amp;radiusDistance=&amp;isKilometers=& +amp;tosearch=yes">Tracking Code</a></td> <td align="center" class="cssSearchResultsColHead" +><a id="header_jobTitle" href="index.cfm?fuseaction=app.jobsearch&amp +;newsort=1&amp;jtorder=asc&amp;thiscol=job_title&amp;company_id=15636 +&amp;version=2&amp;byBusinessUnit=NULL&amp;bycountry=0&amp;bystate=0& +amp;byRegion=&amp;bylocation=&amp;keywords=&amp;byCat=&amp;proximityC +ountry=&amp;postalCode=&amp;radiusDistance=&amp;isKilometers=&amp;tos +earch=yes">Job Title</a></td> <td align="center" class="cssSearchResultsColHead" +><a id="header_location" href="index.cfm?fuseaction=app.jobsearch&amp +;newsort=1&amp;lorder=asc&amp;thiscol=location&amp;company_id=15636&a +mp;version=2&amp;byBusinessUnit=NULL&amp;bycountry=0&amp;bystate=0&am +p;byRegion=&amp;bylocation=&amp;keywords=&amp;byCat=&amp;proximityCou +ntry=&amp;postalCode=&amp;radiusDistance=&amp;isKilometers=&amp;tosea +rch=yes">Location</a></td> <td align="center" class="cssSearchResultsColHead" +><a id="header_datePosted" href="index.cfm?fuseaction=app.jobsearch&a +mp;newsort=1&amp;dporder=asc&amp;thiscol=postingdate&amp;company_id=1 +5636&amp;version=2&amp;byBusinessUnit=NULL&amp;bycountry=0&amp;bystat +e=0&amp;byRegion=&amp;bylocation=&amp;keywords=&amp;byCat=&amp;proxim +ityCountry=&amp;postalCode=&amp;radiusDistance=&amp;isKilometers=&amp +;tosearch=yes">Date Posted</a></td> </tr> <tr class="cssSearchResultsHighlight"> <td align="center" class="cssSearchResults +Body">306145-636</td> <td align="left" class="cssSearchResultsBo +dy"><a id="jobTitle_306145" href="index.cfm?fuseaction=app.jobinfo&am +p;jobid=306145&amp;source=ONLINE&amp;JobOwner=1013826&amp;company_id= +15636&amp;version=2&amp;byBusinessUnit=NULL&amp;bycountry=0&amp;bysta +te=0&amp;byRegion=&amp;bylocation=&amp;keywords=&amp;byCat=&amp;proxi +mityCountry=&amp;postalCode=&amp;radiusDistance=&amp;isKilometers=&am +p;tosearch=yes" class="cssSearchResultsBody">Sales Account Manager - +Enterprise</a></td> <td align="left" class="cssSearchResultsBo +dy">Seattle, Washington, United States</td> <td align="center" class="cssSearchResults +Body">10/31/2013</td> </tr> <tr class="cssSearchResultsLowlight"> <td align="center" class="cssSearchResults +Body">306144-636</td> <td align="left" class="cssSearchResultsBo +dy"><a id="jobTitle_306144" href="index.cfm?fuseaction=app.jobinfo&am +p;jobid=306144&amp;source=ONLINE&amp;JobOwner=1013767&amp;company_id= +15636&amp;version=2&amp;byBusinessUnit=NULL&amp;bycountry=0&amp;bysta +te=0&amp;byRegion=&amp;bylocation=&amp;keywords=&amp;byCat=&amp;proxi +mityCountry=&amp;postalCode=&amp;radiusDistance=&amp;isKilometers=&am +p;tosearch=yes" class="cssSearchResultsBody">Inside Sales Administrat +or</a></td> <td align="left" class="cssSearchResultsBo +dy">Madrid, Madrid, Spain</td> <td align="center" class="cssSearchResults +Body">10/30/2013</td> </tr> <tr class="cssSearchResultsHighlight"> <td align="center" class="cssSearchResults +Body">306143-636</td> <td align="left" class="cssSearchResultsBo +dy"><a id="jobTitle_306143" href="index.cfm?fuseaction=app.jobinfo&am +p;jobid=306143&amp;source=ONLINE&amp;JobOwner=1013767&amp;company_id= +15636&amp;version=2&amp;byBusinessUnit=NULL&amp;bycountry=0&amp;bysta +te=0&amp;byRegion=&amp;bylocation=&amp;keywords=&amp;byCat=&amp;proxi +mityCountry=&amp;postalCode=&amp;radiusDistance=&amp;isKilometers=&am +p;tosearch=yes" class="cssSearchResultsBody">Inside Sales Administrat +or</a></td> <td align="left" class="cssSearchResultsBo +dy">Milano, Lombardia, Italy</td> <td align="center" class="cssSearchResults +Body">10/30/2013</td> </tr> <tr class="cssSearchResultsLowlight"> <td align="center" class="cssSearchResults +Body">306134-636</td> <td align="left" class="cssSearchResultsBo +dy"><a id="jobTitle_306134" href="index.cfm?fuseaction=app.jobinfo&am +p;jobid=306134&amp;source=ONLINE&amp;JobOwner=1013767&amp;company_id= +15636&amp;version=2&amp;byBusinessUnit=NULL&amp;bycountry=0&amp;bysta +te=0&amp;byRegion=&amp;bylocation=&amp;keywords=&amp;byCat=&amp;proxi +mityCountry=&amp;postalCode=&amp;radiusDistance=&amp;isKilometers=&am +p;tosearch=yes" class="cssSearchResultsBody">Senior Technical Consult +ant / Enterprise Solutions Architect</a></td> <td align="left" class="cssSearchResultsBo +dy">Reading, West Berkshire, United Kingdom</td> <td align="center" class="cssSearchResults +Body">10/30/2013</td> </tr> <tr class="cssSearchResultsHighlight"> <td align="center" class="cssSearchResults +Body">306142-636</td> <td align="left" class="cssSearchResultsBo +dy"><a id="jobTitle_306142" href="index.cfm?fuseaction=app.jobinfo&am +p;jobid=306142&amp;source=ONLINE&amp;JobOwner=1013697&amp;company_id= +15636&amp;version=2&amp;byBusinessUnit=NULL&amp;bycountry=0&amp;bysta +te=0&amp;byRegion=&amp;bylocation=&amp;keywords=&amp;byCat=&amp;proxi +mityCountry=&amp;postalCode=&amp;radiusDistance=&amp;isKilometers=&am +p;tosearch=yes" class="cssSearchResultsBody">Product Manager - Databa +se</a></td> <td align="left" class="cssSearchResultsBo +dy">Oceanport, New Jersey, United States</td> <td align="center" class="cssSearchResults +Body">10/29/2013</td> </tr> </tbody>

Comment on Table scraping
Select or Download Code
Re: Table scraping
by keszler (Priest) on Nov 01, 2013 at 06:20 UTC
    The first row of the table I see at that URL does not match headers => [qw( tracking_code job_title location date_posted )].

    Replace it with headers => ['Tracking Code', 'Job Title', 'Location', 'Date Posted'] and the rest of your program works.

      Thank you! Ka-CHING! it works.
Re: Table scraping
by kcott (Abbot) on Nov 01, 2013 at 06:58 UTC

    G'day mcoblentz,

    "I am getting a basic error: Can't call method "rows" on an undefined value ..."

    The only code you show with that method (i.e. the point where the error occurs) is:

    foreach my $row ($te->rows) {

    Some lines earlier in your code you declare $te:

    my $te;

    At the point of declaration it will have "an undefined value". At the point where the error occurs it has "an undefined value". Track its value through the code between those two points to find where it's being directly assigned undef or its value is being changed to undef through a side-effect. You'll now have a narrow focus for your troubleshooting efforts.

    "I have tried dumping the table but I don't understand the results from Dumper."

    You haven't shown this output, or even given a hint as to what part you don't understand, so I'm not sure what you think we can do about that. Perhaps the Data::Dumper documentation will help.

    Your code shows other print statements. At least one of those probably had some output: you don't show that either.

    I recommend you take a look at perlobj: Invoking Class Methods. Read the section "Indirect Object Syntax"; noting its first paragraph, which is all in bold, and includes the text:

    "..., use of this syntax is discouraged as it can confuse the Perl interpreter. ..."

    Please follow this advice and change your code accordingly.

    -- Ken

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1060707]
Approved by hdb
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others cooling their heels in the Monastery: (10)
As of 2014-10-30 13:20 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    For retirement, I am banking on:










    Results (208 votes), past polls