HTML Table Extract

mcoblentz has asked for the wisdom of the Perl Monks concerning the following question:

Brothers (and Sisters!),

I am trying to extract both values and metadata from an HTML table. I've successfully used HTML::TableExtract to get at the values in the table. However, I have just realized that the metadata contains much of the data I really want and would save me considerable processing effort if I just used it directly.

A typical row looks as follows:

</tr><tr><td><a href="shipposition.phtml?call=9HA2188">Thomson Majesty
+</a></td><td>2013-Jul-07 2334</td><td><a href="shiplocations.phtml?la
+t=35.6902&lon=25.6857&radius=200">N&nbsp;35&deg;41' E&nbsp;025&deg;41
+'</a></td><td>9HA2188</td></tr>
[download]

Specifically, <a href="shiplocations.phtml?lat=35.6902&lon=25.6857&radius=200">N 35°41' E 025°41'</a>has more position data in the tag section. And I wouldn't have to convert it to anything.

The column headers for this are: Ship, Last reporting time, Position, and Callsign. While the name is a value item in the first column, the position data in the third column is more definitive in the href metadata section than the value itself.

How would you suggest extracting the metadata from the Position metadata? I initially tried an XML parser (but it's not well formed for that) and then some other HTML parsing but didn't get very far. Which module would the Monks recommend?

My code, which is in a bit of disrepair because I was trying different modules...

#!/usr/bin/perl
use strict;
use warnings;

use WWW::Mechanize;
use Data::Dumper;
use HTML::TableExtract;
use HTML::TreeBuilder::XPath;
use UTF8;

# initialize
my $cols;
my $url;
my $depth;
my $count;
my $data;
my $ship;
my @position;
my @name;
my $callsign;

# get the data from the web.  Typically this is:
# http://www.sailwx.info/shiptrack/cruiseships.phtml
# Either pass this in as --url <page_url> when invoking or just set it
+.

$cols = 'Ship,last reported (UTC),position,Callsign';
$url = "http://www.sailwx.info/shiptrack/cruiseships.phtml";

my $input;

my $out_fn = 'C:\\Program Files\\cron\\Cruise Ships\\ship_data.csv';
open(my $out_fh, '>', $out_fn) or die("Unable to create output file \"
+$out_fn\": $!\n");

my $m = WWW::Mechanize->new();
$m->get($url);
$input = $m->content;

my $te;
if ( defined ($cols))
{
    my @headers = split(/,/, $cols);
    
    $te = HTML::TableExtract->new( attribs => { border => 1 } );
    $te = HTML::TableExtract->new(
                                headers => [qw( Ship position last Cal
+lsign )]
                                ) or die qq{$!};
}
else
{
    $te = new HTML::TableExtract( depth => $depth, count=>$count);
};
$te->parse($input);
foreach my $ship ($te->rows) {
    
    # extract name from row data using XPath
    my $tree = HTML::TreeBuilder::XPath->new_from_content($te);
    my @name = $tree->findvalues('//shipposition');

    print $name[0], "\n";
    
    # extract position from row data using XPath
    my @position = $tree->findvalues('//shiplocations');
    print @position;
    
    
    #    my $re = qr/([NS]?)\s*(\d+)(?:\D*)(\d*).*?,\s*([EW]?)\s*(\d+)
+(?:\D*)(\d*)/;
    #    unless ($position =~ /$re/) {
    #        die "unable to parse position\n";
    #            }
    #    my $lat = $2 + $3/60;
    #    my $long = $5 + $6/60;
    
    #   if ($1 eq 'S') { $lat = -$lat; }
    #   if ($4 eq 'W') { $long = -$long; }
    #   return sprintf("%.2f,%.2f", $lat, $long);
    
    #    $time       = $ { $ship }[2];
    #    $callsign   = $ { $ship }[3];
    #    print "positions: $position \n";
    }
[download]

Comment on HTML Table Extract Select or Download Code

Replies are listed 'Best First'.
Re: HTML Table Extract by Corion (Patriarch) on Jul 08, 2013 at 06:43 UTC
In many such cases it helps to read the documentation of the modules you are using. For example, WWW::Mechanize documents the `->links` method. Maybe you can work with this. The much easier way is to contact the website owners and ask them for a feed or a database dump - most likely you can find an arrangement that is beneficiary for both sides.	[reply] [d/l]
Re: HTML Table Extract by Anonymous Monk on Jul 08, 2013 at 01:56 UTC
Yeah, http://www.sailwx.info/info/termsofuse.html says you can't "web scrape", so no help from me	[reply]
Re^2: HTML Table Extract by mcoblentz (Scribe) on Jul 08, 2013 at 16:14 UTC
Thanks for pointing that out, I will go ask for permission. They may have a feed.	[reply]

Back to Seekers of Perl Wisdom