http://www.perlmonks.org?node_id=841961


in reply to Re: Weather warnings from www.meteoalarm.eu
in thread Weather warnings from www.meteoalarm.eu

I never got a handle on the HTML::TokeParser module so I try to get the data with regexp. But you are right: this is not to proper way to do it.
To get the warnings of all countries the script evaluates http://www.meteoalarm.eu/
I changed your
sub find_img{ my @images; my $content = shift; my $p = HTML::TokeParser::Simple->new(string => $content); while (my $t = $p->get_token){ push @images, $t if $t->is_start_tag(q{img}); } return \@images; }
because there are many images on the page. So again I would need some routine (regexp) to filter out unwanted images.

Replies are listed 'Best First'.
Re^3: Weather warnings from www.meteoalarm.eu
by wfsp (Abbot) on May 28, 2010 at 08:50 UTC
    Ok, from looking at the link we can simplify things enourmously.

    The data we are after are in cells with class col1 or col2. We can loop over those and extract what we need. You will need to tweak as appropriate but hopefully it will give you the idea.

    #! /usr/bin/perl use strict; use warnings; use Data::Dumper; # meteoalarm.html is the source from the website open my $fh, q{<}, q{meteoalarm.html} or die qq{cant open file to read: $!\n}; my $content = do{local $/; <$fh>}; my $mp = Meteoalarm::Parser->new($content); my $data = $mp->parse; print Dumper $data; package Meteoalarm::Parser; use HTML::TreeBuilder; use Data::Dumper; sub new { my $class = shift; my $content = shift; my $p = HTML::TreeBuilder->new_from_content($content); my $self = { parser => $p, }; bless($self, $class); return $self; } sub parse { my $self = shift; my $p = $self->{parser}; my (%data); my @cells = $p->look_down(_tag => q{td}, class => qr/^col[12]$/); for my $cell (@cells){ my $div = $cell->look_down(_tag => q{div}); my $id = $div->id; my $alt = $div->attr(q{alt}); my $img = $div->look_down(_tag => q{img}); my $src = $img?$img->attr(q{src}):q{}; $data{$id}{fullname} = $alt; $data{$id}{warning} = $src; } return \%data; }
    output (extract)
    $VAR1 = { 'UK' => { 'warning' => '', 'fullname' => 'United Kingdom' }, 'CY' => { 'warning' => '', 'fullname' => 'Cyprus' }, 'IE' => { 'warning' => 'Bilder/wf/wf_23.jpg', 'fullname' => 'Ireland' }, 'IS' => { 'warning' => '', 'fullname' => 'Iceland' }, 'NL' => { 'warning' => '', 'fullname' => 'Netherlands' }, 'BE' => { 'warning' => '', 'fullname' => 'Belgium' }, 'AT' => { 'warning' => 'Bilder/wf/wf_23.jpg', 'fullname' => 'Austria' }, };
      Thanks wfsp for your very helpful posts. With your advice I was able to change the script. I updated the original code with the new one.
      The html for country and region warnings differ slightly. So I kept the original structure with different methods in subs.