Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

Parsing a large html with perl

by zesys (Novice)
on Jun 02, 2020 at 20:07 UTC ( #11117607=perlquestion: print w/replies, xml ) Need Help??

zesys has asked for the wisdom of the Perl Monks concerning the following question:

Hello Perl monks, first time asking a question. My apologies in advance if my post is not meeting house rules.

So, I am trying to extract information from a large html file.

Example blocks in the html:
--- <tr> <td class="confluenceTd">AL2H </td> <td class="confluenceTd">NANOMETRICSTITANSMA000357 </td> <td class="confluenceTd">2017-03-09T00:00:00.000Z </td> <td class="confluenceTd">2017-05-25T12:00:00.000Z </td> <td class="confluenceTd"><p><span class="image-wrap" style=""><img src +="/download/attachments/49449087/true.png?versio$ <td class="confluenceTd">48.38981 </td> <td class="confluenceTd">-123.48739 </td> <td class="confluenceTd">-29.0 </td> <td class="confluenceTd">&nbsp;</td> </tr> --- <tr> <td class="confluenceTd">BACAX </td> <td class="confluenceTd">RDIADCP600WH9339 </td> <td class="confluenceTd">2011-07-15T18:42:25.000Z </td> <td class="confluenceTd">2012-05-30T01:12:03.000Z </td> <td class="confluenceTd"><p><span class="image-wrap" style=""><img src +="/download/attachments/49449087/true.png?versio$ <td class="confluenceTd">48.316762 </td> <td class="confluenceTd">-126.050163 </td> <td class="confluenceTd">985.0 </td> <td class="confluenceTd">221.0 </td> </tr> ---

What my code does at the moment: Copy the first 4 lines of the first html block above, and print them with their meanings.

locationCode: AL2H deviceCode: NANOMETRICSTITANSMA000357 dateFrom: 2017-03-09T00:00:00.000Z dateTo: 2017-05-25T12:00:00.000Z

What I would like to achieve:

1. Do the same thing as above by looping through similar blocks.

2. Extract only blocks that have a sub-string "RDI" in their second line (eg., RDIADCP600WH9339 in the second block shown above).

I can try 2 if I can get help with 1.

Thank you.

My semi-working code is below. As you can see, I am storing the html page in a variable, $scrappy.

#!/usr/bin/perl use strict; use warnings; use utf8; use Term::ANSIColor qw(:constants); my $scrappy = `curl -s 'https://wiki.oceannetworks.ca/display/O2A/Available+Deployme +nts' 2>&1`; my $lineX; my $count = 0; foreach $lineX ( split /\n/, $scrappy ) { if ( $lineX =~ /^\s*$/ ) { # Skip white spaces or comment line next; } my @F = split( " ", $lineX ); my $mylen = length $lineX; if ( $mylen ge 2 ) { if ( ( $F[0] eq '<td' ) and ( $F[-1] eq '</td>' ) and ( $F[-1] ne '</p></td>' ) ) { my @f = split />/, $F[1]; $count++; if ( $count == 1 ) { print "locationCode: $f[1]\n"; } elsif ( $count == 2 ) { print "deviceCode: $f[1]\n"; } elsif ( $count == 3 ) { print "dateFrom: $f[1]\n"; } elsif ( $count == 4 ) { print "dateTo: $f[1]\n"; } } } }

Replies are listed 'Best First'.
Re: Parsing a large html with perl
by haukex (Bishop) on Jun 02, 2020 at 21:23 UTC

    Welcome to the Monastery, zesys!

    The top of that page says:

    The following is dynamic list of all of the deployments that have data. It is being pulled from the deployments web service using the URL https://data.oceannetworks.ca/api/deployments?method=get&token=[YOUR_TOKEN_HERE]

    Why don't you just use that API?

    Anyway, if you need to parse HTML, then don't use regular expressions. Here's an example with Mojo::DOM:

    use warnings; use strict; use Mojo::UserAgent; use Mojo::DOM; my $ua = Mojo::UserAgent->new( max_redirects=>3 ); my $dom = $ua->get( 'https://wiki.oceannetworks.ca/display/O2A/Available+Deployments' )->result->dom; $dom->find('.confluenceTable tr')->each(sub { my $tr = shift; my ($locationCode, $deviceCode, $dateFrom, $dateTo) = map { $tr->find(".confluenceTd:nth-of-type($_)") ->map('all_text')->join } 1..4; print "locationCode=$locationCode, deviceCode=$deviceCode, ", "dateFrom=$dateFrom, dateTo=$dateTo\n"; });

      Thanks so much @haukex. I have added two lines of code to yours (had two questions), and problem solved!

      Regarding the API, I use the service using client libraries written for python, almost everyday. I just wanted to do things differently this time by using Perl, for which the organisation does not seem to have a client library.

      Thank you all for your prompt answers and suggestions!!

        You don't need them to provide a client library in perl, writing your own is reasonably straightforward. The advantage of using their API is that generally speaking they are less suceptable to change than a webpage. Super Search for mojo api will find results to get you started.

      OP, please do use the URL at https://wiki.oceannetworks.ca/display/O2A/API+Reference that haukex pointed out.
      • it's a HTTP::Tiny call away! (hopefully an https URL is available)
      • it's JSON!
      • you'll learn a lot and be glad you did

      Note:

      If you do it right, you could get a Perl client listed in there. Also, see if it'll accept the query string via POST body, be sure to set your content-type header in the request to be application/x-www-form-urlencoded. Reason is, sending your special token via GET request is gonna get it logged everywhere and it's not protected by https .. and sometimes end points will accept it just the same as a POST. If it's just http then sending it via POST if it's accepted will at least eliminate your URL from getting logged everywhere with that token in it.

      If you insist on parsing the HTML and it really is just a large simple table, take a look at HTML::TableExtract.

        Usually makes more sense to reply to OP if that is who you are addressing. Your advice assumes they have API access, which may not be the case. The Mojo solution provided can deal just as easily with a JSON response as the HTML.

        Thanks @perlfan. I will try your first suggestion. I admit, as a non-developer, I often find it a daunting task making sense of a JSON response.
Re: Parsing a large html with perl
by jo37 (Friar) on Jun 02, 2020 at 20:15 UTC
A reply falls below the community's threshold of quality. You may see it by logging in.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://11117607]
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others studying the Monastery: (4)
As of 2020-10-30 05:26 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    My favourite web site is:












    Results (277 votes). Check out past polls.

    Notices?