Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

Re^2: problem HTML::FormatText::WithLinks::AndTables

by kevind0718 (Scribe)
on Mar 10, 2013 at 17:11 UTC ( #1022673=note: print w/ replies, xml ) Need Help??


in reply to Re: problem HTML::FormatText::WithLinks::AndTables
in thread problem HTML::FormatText::WithLinks::AndTables

Thanks for taking an interest in my issue.
I choose HTML::FormatText::WithLinks because it holds on to links within the web page. On other pages that I am parsing I need to be able to follow links on the current page to retrieve additional data.
I took a look at the CPAN page for HTML::HTML5::Parser, do not see any mention of what it does with links.

please advise.

Best

KD


Comment on Re^2: problem HTML::FormatText::WithLinks::AndTables
Replies are listed 'Best First'.
Re^3: problem HTML::FormatText::WithLinks::AndTables
by tobyink (Abbot) on Mar 10, 2013 at 22:08 UTC

    HTML::HTML5::Parser parses the HTML into a DOM tree. It preserves all elements and all attributes. (The example I gave earlier showed filtering by the class="thead".)

    Once the HTML is parsed, it's returned as an XML::LibXML::Document object, so you can manipulate it through object-oriented programming using more or less the same DOM API supported by desktop web browsers such as Internet Explorer, Firefox, Chrome, etc. Just using Perl instead of Javascript.

    For example:

    // Javascript var links = document.getElementsByTagName('a'); for (var i = 0; i < links.length; i++) { alert(links[i].href); }
    # Perl my $document = HTML::HTML5::Parser->load_html(location => $url); my @links = $document->getElementsByTagName('a'); for (my $i = 0; $i < @links; $i++) { warn($links[$i]{href}); }

    The majority of HTML parsing modules work along the same lines.

    package Cow { use Moo; has name => (is => 'lazy', default => sub { 'Mooington' }) } say Cow->new->name
      Hello:

      Thanks for that example. It gave me something to thing about.

      The first thing I noticed is that you are using:
      -> querySelectorAll('table#game_info tr') # get all rows from +game_info table
      to find the game table. Not all website provide an id on the table
      please consider the following webpage:
      http://www.databasefootball.com/boxscores/scheduleyear.htm?yr=1985&lg=nfl
      What I am want to extract from here is the basic game info week by week.
      When I look at the source for the page I do not see an id.
      Anyway I wrote this bit of code:
      use strict; use warnings; use HTML::HTML5::Parser; use XML::LibXML::QuerySelector; use XML::LibXML; use Data::Dumper; my $url = "http://www.databasefootball.com/boxscores/scheduleyear.htm? +yr=1985&lg=nfl"; my $parser = HTML::HTML5::Parser->new; my $doc = $parser->parse_file($url); print Dumper $doc; print $doc->toString;
      just to see what HTML::HTML5::Parser would do with the databasefootbal.com season page.
      I expected the webpage would get parse into some sort of XML structure I could query. But I do not see that.
      Hope I am not testing your patience, but how would I get at the tables of scores by week in the above webpage.
      Many thanks

      KD

        Dumper is not especially useful for inspecting XML::LibXML's objects. You see, XML::LibXML is a wrapper for a C library (libxml2) and the real guts of the objects live within the C library.

        This has frustrated me in the past too. The HTML5 spec differentiates between two kinds of xml:lang attributes (attributes called lang in the xml namespace, versus attributes called xml:lang in no namespace!) and toString doesn't distinguish between those. So this was tricky to debug when working on HTML::HTML5::Parser.

        I wrote XML::LibXML::Debugging as a solution, though I rarely use it these days, and don't give it much attention maintenance-wise. The following example gives you a big Perlish tree of nested hashes and arrays:

        use strict; use warnings; use Data::Dumper; use HTML::HTML5::Parser; use XML::LibXML::Debugging; my $document = HTML::HTML5::Parser->load_html(IO => \*DATA); print Dumper( $document->toDebuggingHash ); __DATA__ <!doctype html> <title lang="en">Example</title> <table><tr><td xml:lang="en">Hello world</table>

        However the best ways of "navigating" the XML tree are to use querySelector/querySelectorAll provided by XML::LibXML::QuerySelector (which allow you to choose elements using CSS selectors) or if you need something more powerful, using XML::LibXML's built in XPath support.

        You don't always need an id attribute to select the data you want. For example, say you want to select the third <table> on a page, you could just do:

        my @all_tables = $document->querySelectorAll('table'); my $wanted_table = $table[2];

        Or to select the first <table> within <div class="foo">:

        my @all_tables = $document->querySelectorAll('div.foo table'); my $wanted_table = $table[0];

        Or, because querySelector returns the first match, this is the same:

        my $wanted_table = $document->querySelector('div.foo table');
        package Cow { use Moo; has name => (is => 'lazy', default => sub { 'Mooington' }) } say Cow->new->name

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1022673]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others lurking in the Monastery: (14)
As of 2015-07-31 12:21 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (277 votes), past polls