http://www.perlmonks.org?node_id=1022649


in reply to problem HTML::FormatText::WithLinks::AndTables

I have a feeling that HTML::FormatText::WithLinks::AndTables is not the right module for this task. What you're doing is converting HTML to plain text and then trying to parse that plain text. It would be easier to parse the original HTML. It's like taking cheese, tomato and pepperoni, assembling them into a sandwich, then disassembling the sandwich to make a pizza. Why make the sandwich when you want pizza?

What you want is an HTML parsing module. I'll give you an example using HTML::HTML5::Parser, because I wrote it. There are plenty of other HTML parsers on CPAN though.

use strict; use warnings; use HTML::HTML5::Parser; use XML::LibXML::QuerySelector; use Data::Dumper; my $url = "http://www.pro-football-reference.com/boxscores/198509080ra +m.htm"; my %data = HTML::HTML5::Parser -> load_html(location => $url) -> querySelectorAll('table#game_info tr') # get all rows from +game_info table -> grep(sub { not $_->{class} eq 'thead' }) # ignore class="thea +d" row -> map(sub { # map each row into +a key, value pair my ($key, $value) = $_->querySelectorAll('td'); return $key->textContent => $value->textContent; }); print Dumper \%data;

This outputs...

$VAR1 = { 'Start Time' => '1:00pm', 'Over/Under' => '38.0 (under)', 'Surface' => 'grass', 'Vegas Line' => 'Pick', 'Weather' => '69 degrees, relative humidity 62%, wind 12 mph +', 'Stadium' => 'Anaheim Stadium' };
package Cow { use Moo; has name => (is => 'lazy', default => sub { 'Mooington' }) } say Cow->new->name

Replies are listed 'Best First'.
Re^2: problem HTML::FormatText::WithLinks::AndTables
by kevind0718 (Scribe) on Mar 10, 2013 at 17:11 UTC
    Thanks for taking an interest in my issue.
    I choose HTML::FormatText::WithLinks because it holds on to links within the web page. On other pages that I am parsing I need to be able to follow links on the current page to retrieve additional data.
    I took a look at the CPAN page for HTML::HTML5::Parser, do not see any mention of what it does with links.

    please advise.

    Best

    KD

      HTML::HTML5::Parser parses the HTML into a DOM tree. It preserves all elements and all attributes. (The example I gave earlier showed filtering by the class="thead".)

      Once the HTML is parsed, it's returned as an XML::LibXML::Document object, so you can manipulate it through object-oriented programming using more or less the same DOM API supported by desktop web browsers such as Internet Explorer, Firefox, Chrome, etc. Just using Perl instead of Javascript.

      For example:

      // Javascript var links = document.getElementsByTagName('a'); for (var i = 0; i < links.length; i++) { alert(links[i].href); }
      # Perl my $document = HTML::HTML5::Parser->load_html(location => $url); my @links = $document->getElementsByTagName('a'); for (my $i = 0; $i < @links; $i++) { warn($links[$i]{href}); }

      The majority of HTML parsing modules work along the same lines.

      package Cow { use Moo; has name => (is => 'lazy', default => sub { 'Mooington' }) } say Cow->new->name
        Hello:

        Thanks for that example. It gave me something to thing about.

        The first thing I noticed is that you are using:
        -> querySelectorAll('table#game_info tr') # get all rows from +game_info table
        to find the game table. Not all website provide an id on the table
        please consider the following webpage:
        http://www.databasefootball.com/boxscores/scheduleyear.htm?yr=1985&lg=nfl
        What I am want to extract from here is the basic game info week by week.
        When I look at the source for the page I do not see an id.
        Anyway I wrote this bit of code:
        use strict; use warnings; use HTML::HTML5::Parser; use XML::LibXML::QuerySelector; use XML::LibXML; use Data::Dumper; my $url = "http://www.databasefootball.com/boxscores/scheduleyear.htm? +yr=1985&lg=nfl"; my $parser = HTML::HTML5::Parser->new; my $doc = $parser->parse_file($url); print Dumper $doc; print $doc->toString;
        just to see what HTML::HTML5::Parser would do with the databasefootbal.com season page.
        I expected the webpage would get parse into some sort of XML structure I could query. But I do not see that.
        Hope I am not testing your patience, but how would I get at the tables of scores by week in the above webpage.
        Many thanks

        KD