Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Re^5: Breaking The Rules II

by BrowserUk (Pope)
on Jul 03, 2007 at 21:18 UTC ( #624792=note: print w/ replies, xml ) Need Help??


in reply to Re^4: Breaking The Rules II
in thread Breaking The Rules II

Some monks do indeed argue that "you can parse HTML with a regex" (e.g. tye) but I've never seen any argue against using a parser.

You didn't read what I wrote. I never said "use regex to parse html". If i wanted to parse html, I would use HTML::Parser (which I mentioned above as a fine pragmatic module). But if I want to extract some strings from amongst some other strings, I use regex.

To bring the linked example above up to date as the page it used went away many moons ago. One of the web pages I visit frequently is the BBC Weather forecast.

Here is a script to extract and print a selection of the data that page contains:

#! perl -slw use strict; use LWP::Simple; our $area ||= 1040; my $url = "http://www.bbc.co.uk/weather/5day.shtml?id=$area"; my $html = get $url or die "Failed to get html: $!, $^E"; my $reTag = qr[< [^>]+ > \s*? ]x; my( $state, $temp, $wDir, $wSpeed, $humid, $press, $updown, $vis ) = $ +html =~ m[ Current\s+Nearest\s+Observations $reTag : $reTag+ ( \S+ ) $reTag \s+ ( \d+ ) .+? ( [NSEW]+ ) \s+ \( ( \d+ ) .+? Relative\s+Humidity .+? : \s+ ( \d+ ) .+? Pressure .+? : \s+ ( \d+ ), \s+ ( \S+ ) , .+? Visibility .+? : \s+ ( [^<]+ ) ]smx or warn 'Regex failed!' and getstore $url, 'weather.failed'; print <<EOP; Sky: $state Temp: $temp Wind Direction: $wDir Speed: $wSpeed Humidity: $humid% Pressure: $press $updown Visibility: $vis EOP

When run it produces output like this:

C:\test>getWeather.pl Sky: cloudy Temp: 16 Wind Direction: S Speed: 5 Humidity: 77% Pressure: 999 Rising Visibility: Very good C:\test>getWeather.pl -area=3200 Sky: cloudy Temp: 17 Wind Direction: SW Speed: 17 Humidity: 67% Pressure: 999 Rising Visibility: Very good

Your mission, should you choose to accept it, is to reproduce that script using HTML::TokeParser::Simple and post it here.

If you took the time to learn perl well you'll not need to. And I say that with no limitations or caveats. :-)

That's simply not true. For one, I couldn't replace HTML::Parser with Perl code, because it's a Perl wrapper around an XS wrapper around 41k of intensely involved C code. Nor GD for similar reasons. Nor Time::HiRes. Nor... about 60 more modules, but that would just belabour the point.


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.


Comment on Re^5: Breaking The Rules II
Select or Download Code
Replies are listed 'Best First'.
Re^6: Breaking The Rules II
by wfsp (Abbot) on Jul 04, 2007 at 08:43 UTC
    #! perl use strict; use warnings; use LWP::Simple; use HTML::TokeParser::Simple; our $area ||= 1040; my $url = "http://www.bbc.co.uk/weather/5day.shtml?id=$area"; my $html = get $url or die "Failed to get html: $!, $^E"; my $p = HTML::TokeParser::Simple->new(\$html); while (my $t = $p->get_tag('div')){ last if $t->get_attr('class') and $t->get_attr('class') eq 'display'; } my @data; while (my $t = $p->get_token){ last if $t->is_end_tag('div'); push @data, $t->as_is if $t->is_text; } # raw data #for (0..$#data){ # print "$_: $data[$_]\n"; #} my ($dir, $speed) = split(/\(/, $data[5]); print <<EOP; Sky: $data[2] Temp: $data[3] Wind Direction: $dir Speed: $speed Humidity$data[11] Pressure$data[15] Visibility$data[17] EOP __DATA__ Sky: cloudy Temp: 14 Wind Direction: SW Speed: 10 Humidity: 80, Pressure: 1004, Rising, Visibility: Very good ## raw data ## 0: Current Nearest Observations 1: : 2: cloudy 3: 14 4: C 5: SW (10 6: mph 7: ) 8: Relative Humidity ( 9: &#37; 10: ) 11: : 80, 12: Pressure ( 13: mB 14: ) 15: : 1004, Rising, 16: Visibility 17: : Very good

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://624792]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others perusing the Monastery: (8)
As of 2015-07-30 04:15 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (269 votes), past polls