http://www.perlmonks.org?node_id=78769

deryni has asked for the wisdom of the Perl Monks concerning the following question:

I ask for any and all monks whose knowledge of regular expressions exceeds my own to help me optimize this regex.
Optimize for speed or readability the choice is yours. I am parsing computer generated, badly formatted, HTML which has a number of lines that look like this
<tr> <td width="0" align="center"><font face="Arial" size="2">29022</font> </td><td width="0" align="center"><font face="Arial" size="2">01</font +> </td><td width="0" align="center"><font face="Arial" size="2">354</fon +t> </td><td width="0" align="center"><font face="Arial" size="2">201</fon +t> </td><td width="0" align="center"><font face="Arial" size="2">01</font +> </td><td width="0" align="center"><font face="Arial" size="2">&nbsp;</ +font> </td><td align="center"><font face="Arial" size="2">INTRODUCTION TO FI +LM</font> </td><td align="center"><font face="Arial" size="2"> 3</font> </td><td align="center"><font face="Arial" size="2">MW5M7,8</font> </td><td align="center"><font face="Arial" size="2">MI-100</font> </td><td align="center"><font face="Arial" size="2">&nbsp;</font> </td></tr>
The first piece of data is always 5 digits, the second 2, the third 3, the fourth 3, the fifth 2, the sixth a $nbsp or a single letter, the seventh an unspecified number of words, the eigth a single space followed by a digit, the ninth will always have 1 letter and then will either be another letter followed by a number or two numbers (that will possibly be followed by another similar set, as shown above), the tenth will always be letters followed by a dash and then some numbers or letters, and finally the eleventh can be ignored.

I am currently using /^<tr.*?(\d\d\d\d\d).*?<td.*?>(\d\d).*?<td.*?>(\d\d\d).*?<td.*?>(\d\d\d).*?<td.*?>(\d\d).*?<td\.*?>(?:&).*?<td.*?>(\w+(?:(?:\s\w+)?)*).*?<td.*?>(\s\d).*?<td.*?>(\w(?:(?:[\w\d,])?)*).*?<td.*?>(\w(?:\(?:[\w\d-])?)*)/ to pull out the relevant pieces of information. Any advice would be appreciated.

-Etan

P.S. Oh and if possible some slight explanation would be useful. I understand the basics of regex but flags and oddities still escape me. Thanks in advance.

P.P.S I had been checking for both the <td> and <font> tags but then read this node and realized that I was asking for too much. So I stopped asking for the font tags. Just thought I'd throw that in to put my thought processes up for inspection.

Replies are listed 'Best First'.
Re: Regex optimization
by Anonymous Monk on May 08, 2001 at 11:02 UTC
    If you're sure of the table format...
    while # use a loop to grab all instances (m| # use pipes to delimit, so no escaping / <tr # beginning of row .*? # minimal match of anything >(\d{5}) # > followed by 5 digits (remember digits) .*? # minimal match of anything >(\d{2}) # > followed by 2 digits (remember digits) .*? # minimal match of anything >(\d{3}) # > followed by 3 digits (remember digits) .*? # minimal match of anything >(\d{3}) # > followed by 3 digits (remember digits) .*? # minimal match of anything >(\d{2}) # > followed by 2 digits (remember digits) .*? # minimal match of anything (&nbsp;|\w) # &nbsp; or a letter </FONT> # followed by a closing font tag |isxg) { # case (i)nsensitive, treat as (s)ingle line, # e(x)tended comments, match (g)lobally (all) my @row = ($1,$2,$3,$4,$5,$6); # now do whatever with @row } # condensed while(m|<tr.*?>(\d{5}).*?>(\d{2}).*?>(\d{3}).*?>(\d{3}).*?>(\d{2}).*?( +&nbsp;|\w)</FONT>|isg) { my @row = ($1,$2,$3,$4,$5,$6); }
    not tested, but I think it's OK :)

    cLive ;-)

      Thank you for the prompt response.
      First off, as I said a lot of regexes escapes me, so reminding me that I can use (#) instead of repitition was good.
      Second, while I did remove the checks for the extra <font> tags I did indeed forget that I needn't check for the <td> tags either.
      Third, while I'd imagine that this works for the parts involved I do not need either of the nbsp's or the letter that may be in their place, but the data in between them is important.

      Thank you for all the help.

      -Etan
        No! # is a comment (the x modifier allows you to do this...)

        cLive ;-)

Re: Regex optimization
by larsen (Parson) on May 08, 2001 at 13:29 UTC
    My two cents...

    Why not using HTML::TableExtract? (uh, this is the second time I point to this module in two days). Probably it will be a bit slower, but far more readable. And the code you will write wont break if your assertions will become not valid.

    So far so good. But if you want to use regexp, you could improve readability writing e.g. \d{3} instead of \d\d\d.

Re: Regex optimization
by iakobski (Pilgrim) on May 08, 2001 at 16:16 UTC
    For both readability and a little bit of speed try:
    @results = $str =~ m/>([^<\n]+)</g;
    Then validate the elements of the array if you need to. -- iakobski
      I believe I have mislead you, the new lines in the html are in my representation only not in the original. As I said it is the most badly formatted html I've ever seen.

          -Etan
        Even easier:
        @results = $str =~ m/>([^<]+)</g;

        -- iakobski