Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

problem HTML::FormatText::WithLinks::AndTables

by kevind0718 (Scribe)
on Mar 10, 2013 at 03:55 UTC ( #1022641=perlquestion: print w/ replies, xml ) Need Help??
kevind0718 has asked for the wisdom of the Perl Monks concerning the following question:

Hello Wise Perl Monks:


Here I am again asking for your kind assistance.
For a home "non-commerical" project I am attempting to scrape data from here:
http://www.pro-football-reference.com/boxscores/
My code is below.
Of the html returned from the website I want to parse this table

<table class="sortable stats_table float_left margin_right" id="game_ +info"> <tr class='thead'><th colspan=2>Game Info</th></tr><tr class=""> <td align="" ><b>Stadium</b></td> <td align="" >Hubert H. Humphrey Metrodome (dome)</td> </tr> <tr class=""> <td align="" ><b>Start Time</b></td> <td align="" >12:00pm</td> </tr> <tr class=""> <td align="" ><b>Surface</b></td> <td align="" >astroturf</td> </tr> <tr class=""> <td align="" ><b>Weather</b></td> <td align="" >72 degrees, no wind</td> </tr> <tr class=""> <td align="" ><b>Vegas Line</b></td> <td align="" >San Francisco 49ers <a href='/play-index/tgl_finder.c +gi?request=1&match=season&year_min=1985&year_max=1985&game_type=R&gam +e_num_min=0&game_num_max=99&week_num_min=0&week_num_max=99&game_day_o +f_week=&game_time=&time_zone=&game_location=&game_result=&overtime=&l +eague_id=&team_id=&opp_id=&conference_game=&division_game=&tm_is_play +off=&opp_is_playoff=&tm_is_winning=&opp_is_winning=&tm_scored_first=& +tm_led=&tm_trailed=&c1stat=favored_by&c1comp=eq&c1val=11'>-11.0</a></ +td> </tr> <tr class=""> <td align="" ><b>Over/Under</b></td> <td align="" >46.0 <b>(over)</b></td> </tr> </table>
When I do I get this error
Can't call method "content" on an undefined value at C:/Perl64/site/li +b/HTML/FormatText/WithLinks/AndTables.pm line 217. at C:/Perl64/site/lib/HTML/FormatText/WithLinks/AndTables.pm line 217 HTML::FormatText::WithLinks::AndTables::_format_tables('HTML::Form +atText::WithLinks::AndTables=HASH(0x4325450)', 'HTML::TreeBuilder=HAS +H(0x4326a80)') called at C:/Perl64/site/lib/HTML/FormatText/WithLinks +/AndTables.pm line 101 HTML::FormatText::WithLinks::AndTables::parse('HTML::FormatText::W +ithLinks::AndTables=HASH(0x4325450)', '<table class="sortable stats_ +table float_left margin_right" ...') called at C:/Perl64/site/lib/HTM +L/FormatText/WithLinks/AndTables.pm line 83 HTML::FormatText::WithLinks::AndTables::convert('HTML::FormatText: +:WithLinks::AndTables', '<table class="sortable stats_table float_le +ft margin_right" ...') called at C:/Users/kbd0718/workspace/testPerl/ +testGetProFootballBox.pl line 82
I have gotten HTML::FormatText::WithLinks to work for a couple of other tables within websites. But in this case it fails. The Perl code in HTML::FormatText::WithLinks is beyond me. I can not debug through it. I am hoping that one of you wise monks would that a crack at it. And either tell me what I am doing wrong or suggest a bug fix.

Many thanks for your kind assistance.
KD
use strict; use warnings; use Data::Dumper; use HTML::FormatText::WithLinks::AndTables; use IO::File; use LWP::Simple; my %teamCodes; $teamCodes{"ATL"} = "atl"; ## Atlanta Falcons $teamCodes{"CHI"} = "chi"; ## Chicago Bears $teamCodes{"CIN"} = "cin"; ## Cincinnati Bengals $teamCodes{"CLE"} = "cle"; ## Cleveland Browns $teamCodes{"BUF"} = "buf"; ## Buffalo Bills $teamCodes{"DAL"} = "dal"; ## Dallas Cowboys $teamCodes{"DEN"} = "den"; ## Denver Broncos $teamCodes{"DET"} = "det"; ## Detroit Lions $teamCodes{"GNB"} = "gnb"; ## Green Bay Packers $teamCodes{"HOO"} = "hoo|oti"; ## Houston Oilers $teamCodes{"IND"} = "ind|clt"; ## Indianapolis Colts $teamCodes{"NYJ"} = "nyj"; ## New York Jets $teamCodes{"KAN"} = "kan"; ## Kansas City Chiefs $teamCodes{"LAM"} = "lam|ram"; ## Los Angeles Rams $teamCodes{"LAD"} = "lad|rai"; ## Los Angeles Raiders $teamCodes{"MIA"} = "mia"; ## Miami Dolphins $teamCodes{"MIN"} = "min" ; ## Minnesota Vikings $teamCodes{"NYG"} = "nyg" ; ## New York Giants $teamCodes{"NWE"} = "nwe" ; ## New England Patriots $teamCodes{"NOR"} = "nor"; ## New Orleans Saints $teamCodes{"PHI"} = "phi"; ## Philadelphia Eagles $teamCodes{"PIT"} = "pit"; ## Pittsburgh Steelers $teamCodes{"SEA"} = "sea"; ## Seattle Seahawks $teamCodes{"SDG"} = "sdg"; ## San Diego Chargers $teamCodes{"SFO"} = "sfo"; ## San Francisco 49ers $teamCodes{"SLC"} = "slc|crd"; ## St. Louis Cardinals $teamCodes{"TAM"} = "tam"; ## Tampa Bay Buccaneers $teamCodes{"WAS"} = "was"; ## Washington Redskins my $date1 = "198509080"; my $date2 = "198509090"; my $tKey; my $link ; my $abbriv; my $urlBase = "http://www.pro-football-reference.com/boxscores/"; my $webPageText ; my @teamCode ; my $delimiter = quotemeta("|" ); my $startGameInfo ; my $startGameInfoTbl; my $endGameInfoTbl; my $gameInfoTbl ; while ( ($tKey, $abbriv) = each %teamCodes) { @teamCode = split( /$delimiter/, $abbriv ) ; print "$teamCode[0] \n"; } while ( ($tKey, $abbriv) = each %teamCodes) { @teamCode = split( /$delimiter/, $abbriv ) ; $link = $urlBase . $date1. $teamCode[0] . ".htm" ; print $link; $webPageText = get( $link ) or print "failed on retrieve of + web page\n"; if (index( $webPageText, "File Not Found") > 0 ) { print " failed on retrieve of web page\n"; } else { print "\n$webPageText\n\n"; if ( $startGameInfo = index( $webPageText, "Game Info") + ) { $startGameInfoTbl = rindex($webPageText, "<table class +=", $startGameInfo ); $endGameInfoTbl = index ( $webPageText, "</table>", + $startGameInfo ); $gameInfoTbl = substr($webPageText, $startGameIn +foTbl, $endGameInfoTbl - $startGameInfoTbl +9); print $gameInfoTbl; my $converted = HTML::FormatText::WithLinks::AndTable +s->convert( $gameInfoTbl ); my @lines = split /\n+/, $converted; my $arraySize = @lines; print "\narray size = $arraySize\n"; } } }

Comment on problem HTML::FormatText::WithLinks::AndTables
Select or Download Code
Re: problem HTML::FormatText::WithLinks::AndTables
by tobyink (Abbot) on Mar 10, 2013 at 06:57 UTC

    I have a feeling that HTML::FormatText::WithLinks::AndTables is not the right module for this task. What you're doing is converting HTML to plain text and then trying to parse that plain text. It would be easier to parse the original HTML. It's like taking cheese, tomato and pepperoni, assembling them into a sandwich, then disassembling the sandwich to make a pizza. Why make the sandwich when you want pizza?

    What you want is an HTML parsing module. I'll give you an example using HTML::HTML5::Parser, because I wrote it. There are plenty of other HTML parsers on CPAN though.

    use strict; use warnings; use HTML::HTML5::Parser; use XML::LibXML::QuerySelector; use Data::Dumper; my $url = "http://www.pro-football-reference.com/boxscores/198509080ra +m.htm"; my %data = HTML::HTML5::Parser -> load_html(location => $url) -> querySelectorAll('table#game_info tr') # get all rows from +game_info table -> grep(sub { not $_->{class} eq 'thead' }) # ignore class="thea +d" row -> map(sub { # map each row into +a key, value pair my ($key, $value) = $_->querySelectorAll('td'); return $key->textContent => $value->textContent; }); print Dumper \%data;

    This outputs...

    $VAR1 = { 'Start Time' => '1:00pm', 'Over/Under' => '38.0 (under)', 'Surface' => 'grass', 'Vegas Line' => 'Pick', 'Weather' => '69 degrees, relative humidity 62%, wind 12 mph +', 'Stadium' => 'Anaheim Stadium' };
    package Cow { use Moo; has name => (is => 'lazy', default => sub { 'Mooington' }) } say Cow->new->name
      Thanks for taking an interest in my issue.
      I choose HTML::FormatText::WithLinks because it holds on to links within the web page. On other pages that I am parsing I need to be able to follow links on the current page to retrieve additional data.
      I took a look at the CPAN page for HTML::HTML5::Parser, do not see any mention of what it does with links.

      please advise.

      Best

      KD

        HTML::HTML5::Parser parses the HTML into a DOM tree. It preserves all elements and all attributes. (The example I gave earlier showed filtering by the class="thead".)

        Once the HTML is parsed, it's returned as an XML::LibXML::Document object, so you can manipulate it through object-oriented programming using more or less the same DOM API supported by desktop web browsers such as Internet Explorer, Firefox, Chrome, etc. Just using Perl instead of Javascript.

        For example:

        // Javascript var links = document.getElementsByTagName('a'); for (var i = 0; i < links.length; i++) { alert(links[i].href); }
        # Perl my $document = HTML::HTML5::Parser->load_html(location => $url); my @links = $document->getElementsByTagName('a'); for (my $i = 0; $i < @links; $i++) { warn($links[$i]{href}); }

        The majority of HTML parsing modules work along the same lines.

        package Cow { use Moo; has name => (is => 'lazy', default => sub { 'Mooington' }) } say Cow->new->name

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1022641]
Approved by kcott
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chilling in the Monastery: (13)
As of 2014-09-18 12:01 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (113 votes), past polls