Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

problem HTML::FormatText::WithLinks::AndTables

by kevind0718 (Scribe)
on Dec 07, 2012 at 16:21 UTC ( #1007777=perlquestion: print w/ replies, xml ) Need Help??
kevind0718 has asked for the wisdom of the Perl Monks concerning the following question:

Hello Kind Monks:

I hope you will have time for my dilemma. I need to parse HTML tables.
Like the data found at this link: http://www.databasefootball.com/boxscores/scheduleyear.htm?yr=1985&lg=nfl
So I installed the module HTML::FormatText::WithLinks::AndTables;
Which came really close to doing what I need.
After downloading the source for the above page, I ran it through a bit of code and got the following:
1985 NFL Season Scores, Schedules and Playoffs [1]1986 1985 [2]1984 Week 1 Sunday, September 8 [1]IND 3 at [1]PIT + 45 [1]BOX + + + + + + + 1. /teams/teamyear.htm?tm=IND&lg=NFL&yr=1985 1. /teams/tea +myear.htm?tm=PIT&lg=NFL&yr=1985 1. /boxscores/gamedata.ht +m?dy=8&mth=9&yr=1985&tm=PIT&lg=NFL [1]SDG 14 at [1]BUF + 9 [1]BOX + + + + + + + 1. /teams/teamyear.htm?tm=SDG&lg=NFL&yr=1985 1. /teams/tea +myear.htm?tm=BUF&lg=NFL&yr=1985 1. /boxscores/gamedata.ht +m?dy=8&mth=9&yr=1985&tm=BUF&lg=NFL [1]DEN 16 at [1]LAM + 20 [1]BOX + + + + + + + 1. /teams/teamyear.htm?tm=DEN&lg=NFL&yr=1985 1. /teams/tea +myear.htm?tm=LAM&lg=NFL&yr=1985 1. /boxscores/gamedata.ht +m?dy=8&mth=9&yr=1985&tm=LAM&lg=NFL [1]PHI 0 at [1]NYG + 21 [1]BOX + + + + + + + 1. /teams/teamyear.htm?tm=PHI&lg=NFL&yr=1985 1. /teams/tea +myear.htm?tm=NYG&lg=NFL&yr=1985 1. /boxscores/gamedata.ht +m?dy=8&mth=9&yr=1985&tm=NYG&lg=NFL [1]SLC 27 at [1]CLE + 24 OT [1]BOX + + +

This close to what I need. Problem the URLs are striped out into "footnotes", but they are all [1]. Makes it a bit of a bummer to line up the table contents with the corresponding URL.
Can be done, but I was expecting the footnotes to come through as [1], [2], [3].
Hoping somebody has more experience with this module than I do.

thanks in advance

KD

Comment on problem HTML::FormatText::WithLinks::AndTables
Download Code
Re: problem HTML::FormatText::WithLinks::AndTables
by roboticus (Canon) on Dec 07, 2012 at 17:01 UTC

    kevind0718:

    You asked the same question last week, but got no replies. I'm not familiar with HTML::FormatText::WithLinks::AndTables, but after a cursory review of the documentation, I don't recall seeing anything mentioned about footnotes. Are you sure that the module has anything to do with your problem?

    You don't show any code, so it's hard to offer any help, as I don't know how footnotes ever come into the picture.

    ...roboticus

    When your only tool is a hammer, all problems look like your thumb.

Re: problem HTML::FormatText::WithLinks::AndTables
by DanEllison (Sexton) on Dec 07, 2012 at 17:57 UTC
    I don't have any experience with HTML::FormatText, but LWP and HTML::TreeBuilder are rather dear to my heart. Try out the following code, I think it will get you closer to what you want. I just put the links in line with the text they were associated to, but you could easily print a footnote an push them into an array.
    use strict; use LWP::UserAgent; use HTML::TreeBuilder; my $ua = LWP::UserAgent->new; my $page = $ua->LWP::UserAgent::get("http://www.databasefootball.com/b +oxscores/scheduleyear.htm?yr=1985&lg=nfl", 'User-Agent'=>'Mozilla/5.0'); my $tree = HTML::TreeBuilder->new; $tree->parse_content($page->decoded_content); #$tree->dump; foreach my $table ($tree->look_down('_tag', 'table')) { print "###Table###\n"; foreach my $row ($table->look_down('_tag', 'tr', sub { $_[0]->look +_up('_tag', 'table') == $table; })) { if ($row->as_text =~ / at /) { my $c = 0; foreach my $col ($row->look_down('_tag', qr/^t[dh]$/, sub +{ $_[0]->look_up('_tag', 'tr') == $row; })) { if ($c++) { print " "; } printf "%s", $col->as_text; if ($col->look_down('href', qr/./)) { printf " [%s]", $col->look_down('href', qr/./)->at +tr('href'); } } print "\n"; } else { printf "%s\n", $row->as_text; } } } exit;
    When looking for rows and columns, I perform a look_up nexted inside my look_down. This may not be an issue for this webpage, but I deal with a lot of nested tables on my websites, and this eliminates processing a row within a nested table.
Re: problem HTML::FormatText::WithLinks::AndTables
by Khen1950fx (Canon) on Dec 07, 2012 at 22:45 UTC
    Try this:
    #!/usr/bin/perl use strict; use warnings; use LWP::Simple; use HTML::FormatText::WithLinks; my $html = get( "http://www.perlmonks.org/?node=Recently%20Active%20Threads" ); my $f = HTML::FormatText::WithLinks->new( base => "http://www.perlmonks.org/", unique_links => 1, link_num_generator => \&generator, before_link => '[%n]', footnote => '%n est %l' ); sub generator() { my $num = shift; $num += 1; return $num; } print $f->parse($html);

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1007777]
Approved by GrandFather
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others scrutinizing the Monastery: (17)
As of 2014-07-22 16:49 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (120 votes), past polls