Regex Exercise

deprecated has asked for the wisdom of the Perl Monks concerning the following question:

Hello, Monks...

I am writing a small script to hashify an HTML table. The table is large, but completely homogenous (thank goodness). So, without further ado, I give you the html:

<tr><td><b><a
href=i386/zh-xcin-2.3.04.tgz-long.html>zh-xcin-2.3.04.tgz</a></b></td>
+<td>&nbsp&nbsp&nbsp
<i>chinese input utility for X
</i></td><td>[ <a href=ftp://ftp.openbsd.org/pub/OpenBSD/2.8/packages/
+i386/zh-xcin-2.3.04.tgz>FTP Site
1</a> ]</td><td>
[ <a href=ftp://ftp1.usa.openbsd.org/pub/OpenBSD/2.8/packages/i386/zh-
+xcin-2.3.04.tgz>FTP Site 2</a> ]</td></tr>
[download]

So, for simplicity I zapped the /n/r that was lurking in there and have something thats a big brick of html (which I will spare all of you, nobody ever said html was pretty). So I have the following code:

my @fields = split '<tr><td><b>', $input;
foreach my $field (@fields) {
  # what i really wanted to do was...
  # (undef, $names{$1}) =~ m// but that didnt work either
  # so I added the $foo and $bar.
  my ($foo, $bar) = $field =~
    m!^<a href=.*>(.*)</a></b></td><td>&nbsp{3}<i>(.*)</i>.*$!x;
  $names{$foo} = $bar;
  print "$foo == $bar\n";
  }
[download]

If i print $field I do get my html, so I know $field is okay... I think the problem is the regex. In fact, im 90% sure its the regex. But where is it wrong given the data? It looks fine to me.

Thanks
brother dep.

--
transcending "coolness" is what makes us cool.

Comment on Regex Exercise Select or Download Code

Replies are listed 'Best First'.
Re: Regex Exercise by japhy (Canon) on Mar 16, 2001 at 21:44 UTC
The problem is `&nbsp{3}` matches the string "&nbsppp", not "&nbsp&nbsp&nbsp". `japhy` -- Perl and Regex Hacker	[reply]
Re: Regex Exercise by Malkavian (Friar) on Mar 16, 2001 at 21:54 UTC
Perhaps the line: `m!^<a href=[^>]+>([^<]+)</a></b></td><td>(?:&nbsp){3}<i>([^<]+)</i>.$ +!x;` [download] may help? (It's untested, and I'm not that great at regex either. :) ) Malk. Updated* Ooops, forgot the capturing brackets, now added back in.	[reply] [d/l]
Re: Regex Exercise by gryphon (Abbot) on Mar 16, 2001 at 22:32 UTC
Greetings deprecated, Well, this isn't the best or most compact regex in the world, but I've tested this, and it appears to work in the trials I've done. Give it a try. Someone with additional regex experience should be able to shorten my match string somewhat, I suspect. use strict; my $input = "<tr><td><b><a href=i386/zh-xcin-2.3.04.tgz-long.html>zh-x +cin-2.3.04.tgz</a></b></td><td>&nbsp&nbsp&nbsp<i>chinese input utilit +y for X</i></td><td>[ <a href=ftp://ftp.openbsd.org/pub/OpenBSD/2.8/p +ackages/i386/zh-xcin-2.3.04.tgz>FTP Site 1</a> ]</td><td>[ <a href=ft +p://ftp1.usa.openbsd.org/pub/OpenBSD/2.8/packages/i386/zh-xcin-2.3.04 +.tgz>FTP Site 2</a> ]</td></tr>"; my %data; my @fields = split '<tr><td><b>', $input; shift @fields; foreach my $field (@fields) { ($data{fileurl}, $data{filename}, $data{description}, $data{ftp1}, + $data{ftp2}) = $field =~ m#^<a href=(.?)>(.?)</a></b></td><td>&nbsp&nbsp&nbsp<i>(.?) +</i></td><td>\[ <a href=(.?)>.?</a> ]</td><td>\[ <a href=(.?)>.#; print "$2 == $3\n"; } [download] Yeah, I know. It's a bit clunky. Given additional known constants for your specific situation, you may be able to streamline this a bit better than me. Anyway, good luck! -Gryphon*.	[reply] [d/l]


Keep It Simple, Stupid
	PerlMonks