Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options

Re: Weather warnings from

by wfsp (Abbot)
on May 27, 2010 at 19:39 UTC ( #841960=note: print w/replies, xml ) Need Help??

in reply to Weather warnings from

It might be worth considering separating out extracting the data from the HTML. It could make the flow of the logic a bit easier. I would also recommend using a parser rather than regexes which can get a bit tricky on HTML.

I was unable to find a page on the website that corresponded to your regexes so I have taken a guess at what it might look like. If you could post a link to an actual page you're dealing with we might have more to go on. For instance, this uses HTML::TokeParser::Simple to do a single pass examining every token and extracting data as appropriate (it covers similar ground to the regex in your _extract_details method).

If the page is well structured it may be more appropriate to consider something like HTML::TreeBuilder which is more powerful and could simplify proceedings greatly.

#! /usr/bin/perl use strict; use warnings; use Data::Dumper; { package Meteoalarm::Parser; use HTML::TokeParser::Simple; sub new { my $class = shift; my $content = shift; my $p = HTML::TokeParser::Simple->new(string => $content); my $self = { parser => $p, }; bless($self, $class); return $self; } sub parse { my $self = shift; my (%data, $txt); my $t = $self->find_img() or return; $txt = $self->get_div_txt(q{info}); ($data{from}, $data{until}) = $txt =~ /^valid from (.*)Until(.*)$/ +; $txt = $self->get_div_txt(q{info}); ($data{type}, $data{level}) = $txt =~ /^(.*)Awareness Level: (.*)$ +/; $self->{data} = \%data; return 1; } sub find_img{ my $self = shift; my $p = $self->{parser}; while (my $t = $p->get_token){ return $t if $t->is_start_tag(q{img}); } return; } sub get_div_txt{ my $self = shift; my $div_class = shift; my $p = $self->{parser}; my $txt; while (my $t = $p->get_token){ if ( $t->is_start_tag(q{div}) and $t->get_attr(q{class}) and $t->get_attr(q{class}) eq $div_class ){ $p->get_token; $txt = $p->get_phrase; return $txt; } } return; } sub get_data{ my $self = shift; return $self->{data}; } } # script my $content = do{local $/; <DATA>}; my $mp = Meteoalarm::Parser->new($content); while ($mp->parse){ my $data = $mp->get_data; print Dumper $data; } __DATA__ <img src="my.jpeg"> <!-- possible stuff --> <div class="info"> <b>valid from</b> from date 1 <b>Until</b> until date 1 </div> <div class="info"> <b>type 1</b> Awareness Level: <b>awareness level 1</b> </div> <div class="text"> text </div> <!-- possible stuff --> <img src="my_other.jpeg"> <!-- possible stuff --> <div class="info"> <b>valid from</b> from date 2 <b>Until</b> until date 2 </div> <div class="info"> <b>type 2</b> Awareness Level: <b>awareness level 2</b> </div> <div class="text"> text </div> <!-- and so on -->
$VAR1 = { 'level' => 'awareness level 1', 'until' => ' until date 1', 'from' => 'from date 1 ', 'type' => 'type 1 ' }; $VAR1 = { 'level' => 'awareness level 2', 'until' => ' until date 2', 'from' => 'from date 2 ', 'type' => 'type 2 ' };
update: added output

Replies are listed 'Best First'.
Re^2: Weather warnings from
by walto (Pilgrim) on May 27, 2010 at 20:23 UTC
    I never got a handle on the HTML::TokeParser module so I try to get the data with regexp. But you are right: this is not to proper way to do it.
    To get the warnings of all countries the script evaluates
    I changed your
    sub find_img{ my @images; my $content = shift; my $p = HTML::TokeParser::Simple->new(string => $content); while (my $t = $p->get_token){ push @images, $t if $t->is_start_tag(q{img}); } return \@images; }
    because there are many images on the page. So again I would need some routine (regexp) to filter out unwanted images.
      Ok, from looking at the link we can simplify things enourmously.

      The data we are after are in cells with class col1 or col2. We can loop over those and extract what we need. You will need to tweak as appropriate but hopefully it will give you the idea.

      #! /usr/bin/perl use strict; use warnings; use Data::Dumper; # meteoalarm.html is the source from the website open my $fh, q{<}, q{meteoalarm.html} or die qq{cant open file to read: $!\n}; my $content = do{local $/; <$fh>}; my $mp = Meteoalarm::Parser->new($content); my $data = $mp->parse; print Dumper $data; package Meteoalarm::Parser; use HTML::TreeBuilder; use Data::Dumper; sub new { my $class = shift; my $content = shift; my $p = HTML::TreeBuilder->new_from_content($content); my $self = { parser => $p, }; bless($self, $class); return $self; } sub parse { my $self = shift; my $p = $self->{parser}; my (%data); my @cells = $p->look_down(_tag => q{td}, class => qr/^col[12]$/); for my $cell (@cells){ my $div = $cell->look_down(_tag => q{div}); my $id = $div->id; my $alt = $div->attr(q{alt}); my $img = $div->look_down(_tag => q{img}); my $src = $img?$img->attr(q{src}):q{}; $data{$id}{fullname} = $alt; $data{$id}{warning} = $src; } return \%data; }
      output (extract)
      $VAR1 = { 'UK' => { 'warning' => '', 'fullname' => 'United Kingdom' }, 'CY' => { 'warning' => '', 'fullname' => 'Cyprus' }, 'IE' => { 'warning' => 'Bilder/wf/wf_23.jpg', 'fullname' => 'Ireland' }, 'IS' => { 'warning' => '', 'fullname' => 'Iceland' }, 'NL' => { 'warning' => '', 'fullname' => 'Netherlands' }, 'BE' => { 'warning' => '', 'fullname' => 'Belgium' }, 'AT' => { 'warning' => 'Bilder/wf/wf_23.jpg', 'fullname' => 'Austria' }, };
        Thanks wfsp for your very helpful posts. With your advice I was able to change the script. I updated the original code with the new one.
        The html for country and region warnings differ slightly. So I kept the original structure with different methods in subs.
Re^2: Weather warnings from
by StommePoes (Scribe) on Jun 03, 2010 at 07:58 UTC

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://841960]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others avoiding work at the Monastery: (4)
As of 2019-10-19 01:45 GMT
Find Nodes?
    Voting Booth?