Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

Re: Re: Re: Which loop should I use?

by tedrek (Pilgrim)
on Jul 31, 2003 at 19:16 UTC ( #279730=note: print w/replies, xml ) Need Help??


in reply to Re: Re: Which loop should I use?
in thread Which loop should I use?

There's nothing wrong with your loop.. but the HTML on the first page is different from the rest, you're just not parsing it correctly.The first page doesn't have a - between the < and 'Next Chatter' which throws off your regex.

Well I got interested in this problem, So here's a replacement :). I dropped HTML::Tree because I thought it would be nice to be able to have the full text of messages and it was much simpler to just grab straight from the HTML. It should be trivial to run the message text back through HTML::Tree to strip the HTML. I had originally tried tackling this using the parse tree but the code was twice as long and uglier, not to mention it didn't work :(. I created a get function that grabbed off local disk so I didn't have to hit the website whenever I wanted to test.

And on to the code!:

#!/usr/bin/perl use warnings; use strict; my $start = 1; my $end = 10; for my $cnt ($start..$end) { print "<b>Current page count: $cnt</b><p>\n"; my $uri = "http://www.allpoetry.com/chat/page=$cnt"; my $html = get($uri); # retrieve the text and split into lines my @lines = split /[\r\n]+/, $html; # Now get into trouble for parsing HTML by hand # This skips through until the first chat message hopefully. while (@lines) { if ($lines[0] =~ m/^\<a href="javascript:t\('/) { last; } else { shift @lines; } } my @messages; while (@lines) { my $line = shift @lines; # get out after parsing all the messages from history # so we don't capture the current chatbox. last if $line =~ /^<\/font>/; #We use next because actions aren't grabbed properly # To handle them this needs to look for the line starting # with a <i> and no second <a href='...'> # There may be other messages this doesn't handle. next unless $line =~ s/^<a\shref="javascript:t\(' ([\d\w\s_]+) '\)">\1<\/a> <a\shref='\/poets\/\1'>:<\/a>//x; my $user = $1; next unless $line =~ s/^(.*?) \((\d+\s+ (?:days|hours?|minutes|seconds) \s+ago)\) \s+(?:<br>|<p>)//x; push @messages, {user => $user, content => $1, delay => $2}; } foreach (@messages) { print sprintf("%15s:\%s (\%s)<br>\n", $_->{user}, $_->{content}, $_->{delay}); } } exit(0); sub get { my $uri = shift; $uri =~ /(\d+)$/; my $number = $1; open my $html, "<", $number or die "Couldn't open $number: $!"; local $/; my $ret = <$html>; close $html or die "Couldn't close $number: $!"; return $ret; }

Update: Everything that isn't struck out :)

Update2: Added a few linebreaks so one line of code wouldn't wrap

Replies are listed 'Best First'.
Re: Re: Re: Re: Which loop should I use?
by coldfingertips (Pilgrim) on Aug 01, 2003 at 06:44 UTC
    Thank you for the code rewrite and the interest you have in this problem. While running your code I was presented with 500 ISE:
    syntax error at newparse.pl line 63, near "+}" Can't use global $1 in "my" at newparse.pl line 72, near "= $1" Can't use global $! in "my" at newparse.pl line 73, near "$number: $!" Can't use global $! in "my" at newparse.pl line 76, near "$number: $!"
    And I cannot use your script for anything but reference because there are many things within your version that I don't understand (most of your script actually). Using HTML::Tree the code may not have been perfect but I understood all of what I was trying to do.

    I will definitely keep this script and I will try to work out the bugs. Thanks!

      The error you got was because one line got wrapped by PM, I've added a few line breaks so nothing wraps now. In your original code there was three things that jumped out at me. The first was a 'local $/;', you aren't doing anything with IO so that doesn't do anything. The second is your use of 'split "<br>",...', You are using that on the text only version which will not have any html in it so it is just assigning the text to $lines[0]. Also you split on "<br>" later which won't do anything because the original split would have gotten all of the <br>'s anyway. The third is your regex for getting $goodlines, it will only match against a string that hasn't been broken up into lines, which means your split has to fail for that regex to work. Anyway, hope you can get it to work, and if you have any questions feel free to ask.

      Tedrek

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://279730]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others about the Monastery: (7)
As of 2021-09-17 08:26 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?