mcoblentz has asked for the wisdom of the Perl Monks concerning the following question:
I am trying to extract both values and metadata from an HTML table. I've successfully used HTML::TableExtract to get at the values in the table. However, I have just realized that the metadata contains much of the data I really want and would save me considerable processing effort if I just used it directly.
A typical row looks as follows:
Specifically, <a href="shiplocations.phtml?lat=35.6902&lon=25.6857&radius=200">N 35°41' E 025°41'</a>has more position data in the tag section. And I wouldn't have to convert it to anything.</tr><tr><td><a href="shipposition.phtml?call=9HA2188">Thomson Majesty +</a></td><td>2013-Jul-07 2334</td><td><a href="shiplocations.phtml?la +t=35.6902&lon=25.6857&radius=200">N 35°41' E 025°41 +'</a></td><td>9HA2188</td></tr>
The column headers for this are: Ship, Last reporting time, Position, and Callsign. While the name is a value item in the first column, the position data in the third column is more definitive in the href metadata section than the value itself.
How would you suggest extracting the metadata from the Position metadata? I initially tried an XML parser (but it's not well formed for that) and then some other HTML parsing but didn't get very far. Which module would the Monks recommend?
My code, which is in a bit of disrepair because I was trying different modules...
#!/usr/bin/perl use strict; use warnings; use WWW::Mechanize; use Data::Dumper; use HTML::TableExtract; use HTML::TreeBuilder::XPath; use UTF8; # initialize my $cols; my $url; my $depth; my $count; my $data; my $ship; my @position; my @name; my $callsign; # get the data from the web. Typically this is: # http://www.sailwx.info/shiptrack/cruiseships.phtml # Either pass this in as --url <page_url> when invoking or just set it +. $cols = 'Ship,last reported (UTC),position,Callsign'; $url = "http://www.sailwx.info/shiptrack/cruiseships.phtml"; my $input; my $out_fn = 'C:\\Program Files\\cron\\Cruise Ships\\ship_data.csv'; open(my $out_fh, '>', $out_fn) or die("Unable to create output file \" +$out_fn\": $!\n"); my $m = WWW::Mechanize->new(); $m->get($url); $input = $m->content; my $te; if ( defined ($cols)) { my @headers = split(/,/, $cols); $te = HTML::TableExtract->new( attribs => { border => 1 } ); $te = HTML::TableExtract->new( headers => [qw( Ship position last Cal +lsign )] ) or die qq{$!}; } else { $te = new HTML::TableExtract( depth => $depth, count=>$count); }; $te->parse($input); foreach my $ship ($te->rows) { # extract name from row data using XPath my $tree = HTML::TreeBuilder::XPath->new_from_content($te); my @name = $tree->findvalues('//shipposition'); print $name[0], "\n"; # extract position from row data using XPath my @position = $tree->findvalues('//shiplocations'); print @position; # my $re = qr/([NS]?)\s*(\d+)(?:\D*)(\d*).*?,\s*([EW]?)\s*(\d+) +(?:\D*)(\d*)/; # unless ($position =~ /$re/) { # die "unable to parse position\n"; # } # my $lat = $2 + $3/60; # my $long = $5 + $6/60; # if ($1 eq 'S') { $lat = -$lat; } # if ($4 eq 'W') { $long = -$long; } # return sprintf("%.2f,%.2f", $lat, $long); # $time = $ { $ship }[2]; # $callsign = $ { $ship }[3]; # print "positions: $position \n"; }
|
---|
Replies are listed 'Best First'. | |
---|---|
Re: HTML Table Extract
by Corion (Patriarch) on Jul 08, 2013 at 06:43 UTC | |
Re: HTML Table Extract
by Anonymous Monk on Jul 08, 2013 at 01:56 UTC | |
by mcoblentz (Scribe) on Jul 08, 2013 at 16:14 UTC |