Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw

HTML Table Extract

by mcoblentz (Scribe)
on Jul 08, 2013 at 01:47 UTC ( #1043026=perlquestion: print w/replies, xml ) Need Help??
mcoblentz has asked for the wisdom of the Perl Monks concerning the following question:

Brothers (and Sisters!),

I am trying to extract both values and metadata from an HTML table. I've successfully used HTML::TableExtract to get at the values in the table. However, I have just realized that the metadata contains much of the data I really want and would save me considerable processing effort if I just used it directly.

A typical row looks as follows:

</tr><tr><td><a href="shipposition.phtml?call=9HA2188">Thomson Majesty +</a></td><td>2013-Jul-07 2334</td><td><a href="shiplocations.phtml?la +t=35.6902&lon=25.6857&radius=200">N&nbsp;35&deg;41' E&nbsp;025&deg;41 +'</a></td><td>9HA2188</td></tr>
Specifically, <a href="shiplocations.phtml?lat=35.6902&lon=25.6857&radius=200">N&nbsp;35&deg;41' E&nbsp;025&deg;41'</a>
has more position data in the tag section. And I wouldn't have to convert it to anything.

The column headers for this are: Ship, Last reporting time, Position, and Callsign. While the name is a value item in the first column, the position data in the third column is more definitive in the href metadata section than the value itself.

How would you suggest extracting the metadata from the Position metadata? I initially tried an XML parser (but it's not well formed for that) and then some other HTML parsing but didn't get very far. Which module would the Monks recommend?

My code, which is in a bit of disrepair because I was trying different modules...

#!/usr/bin/perl use strict; use warnings; use WWW::Mechanize; use Data::Dumper; use HTML::TableExtract; use HTML::TreeBuilder::XPath; use UTF8; # initialize my $cols; my $url; my $depth; my $count; my $data; my $ship; my @position; my @name; my $callsign; # get the data from the web. Typically this is: # # Either pass this in as --url <page_url> when invoking or just set it +. $cols = 'Ship,last reported (UTC),position,Callsign'; $url = ""; my $input; my $out_fn = 'C:\\Program Files\\cron\\Cruise Ships\\ship_data.csv'; open(my $out_fh, '>', $out_fn) or die("Unable to create output file \" +$out_fn\": $!\n"); my $m = WWW::Mechanize->new(); $m->get($url); $input = $m->content; my $te; if ( defined ($cols)) { my @headers = split(/,/, $cols); $te = HTML::TableExtract->new( attribs => { border => 1 } ); $te = HTML::TableExtract->new( headers => [qw( Ship position last Cal +lsign )] ) or die qq{$!}; } else { $te = new HTML::TableExtract( depth => $depth, count=>$count); }; $te->parse($input); foreach my $ship ($te->rows) { # extract name from row data using XPath my $tree = HTML::TreeBuilder::XPath->new_from_content($te); my @name = $tree->findvalues('//shipposition'); print $name[0], "\n"; # extract position from row data using XPath my @position = $tree->findvalues('//shiplocations'); print @position; # my $re = qr/([NS]?)\s*(\d+)(?:\D*)(\d*).*?,\s*([EW]?)\s*(\d+) +(?:\D*)(\d*)/; # unless ($position =~ /$re/) { # die "unable to parse position\n"; # } # my $lat = $2 + $3/60; # my $long = $5 + $6/60; # if ($1 eq 'S') { $lat = -$lat; } # if ($4 eq 'W') { $long = -$long; } # return sprintf("%.2f,%.2f", $lat, $long); # $time = $ { $ship }[2]; # $callsign = $ { $ship }[3]; # print "positions: $position \n"; }

Replies are listed 'Best First'.
Re: HTML Table Extract
by Corion (Pope) on Jul 08, 2013 at 06:43 UTC

    In many such cases it helps to read the documentation of the modules you are using. For example, WWW::Mechanize documents the ->links method. Maybe you can work with this.

    The much easier way is to contact the website owners and ask them for a feed or a database dump - most likely you can find an arrangement that is beneficiary for both sides.

Re: HTML Table Extract
by Anonymous Monk on Jul 08, 2013 at 01:56 UTC
      Thanks for pointing that out, I will go ask for permission. They may have a feed.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1043026]
Approved by ww
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others examining the Monastery: (4)
As of 2018-02-22 11:51 GMT
Find Nodes?
    Voting Booth?
    When it is dark outside I am happiest to see ...

    Results (291 votes). Check out past polls.