Re: Unrolling the loop technique

in reply to Unrolling the loop technique

OK, I'll try my best at explaining the "unrolling the loop" mechanism (I don't have MRE at hand, so feel free to correct me if I am blatantly wrong!):

I use this technique in 2 cases:

When I want to match a string between 2 1-char delimiters, but the end delimiter can be included in the string through an escape mechanism. For example a double-quoted string that can include the double-quote if backslashed. In this case the "naive" m/"[^"]*"/ would not work.
When I want to match a string between 2 delimiters but the end delimiter is several characters long, such as in C comments (/*comment*/). In this case /*(.*?)*/ would work but it might be slow (I think, I have not benchmarked it).

In the first case here is how you want to match:

the start delimiter,
then an "easy" string (the normal* part): anything that does not include the escape character,
if you find the escape character then you want to skip the next character (the special part), whatever it is, and then match the following "easy" string (the second normal*), until you get to the next escape character or the end delimiter,
finally you match the end delimiter

Now if you want to match a string with a multi-character end delimiter here is how to do it:

match the start delimiter,
match anything until you get to the first character of the end delimiter,
if that character is not followed by the rest of the end delimiter then you can keep on matching "regular" characters until the next time you find that character,
match the end delimiter

A potential pitfall is that you want to make sure you don't consume the characters just after the first character of the end delimiter, or things like **/ (the first character of the end delimiter is there twice in a row, once as a regular character and once as the start of the end delimiter) would not be processed properly.

I guess a couple of examples might be appropriate.

First matching a double-quoted string, double quotes can be escaped using \":

#!/bin/perl -w
use strict;

while( <DATA>)
  { next if(/^\s*#/); # skip comments in DATA
    chomp;
    # split data into string to match and expected result(s)
    my( $string, @expected)= split /\s*=>\s*/;
    while( $string=~ m{"                              # the start deli
+miter
                        ([^\\"]*                      # anything but t
+he end of the string or the escape char
                                (?:\\.                #     the escape
+ char preceeding an escaped char (any char)
                                      [^\\"]*         #     anything b
+ut the end of the string or the escape char
                                             )*)      #     repeat
                                                "}gx) # the end delimi
+ter
      { my $match= $1;
        my $expected= shift @expected;
        unless( $match eq $expected)
          { print "unexpected result line $.: found /$match/, expectin
+g /$expected/\n"; }
      }
  }

__DATA__
# string to match                 => expected results(s)
toto
"a string" tata                   => a string
toto "a string"                   => a string
toto "a string" tata              => a string
toto "a \" string" tata           => a \" string
toto "\" string" tata             => \" string
toto "a\"" tata                   => a\"
toto "\"" tata                    => \"
toto "\"\"" tata                  => \"\"
toto "string 1" "string 2" tata   => string 1 => string 2
toto "string 1                    =>
toto "string 1" "string           => string 1
toto "tata\\" tutu                => tata\\
toto "tata\\\"" tutu              => tata\\\"
[download]

And now how to match C-like comments:

#!/bin/perl -w
use strict;

while( <DATA>)
  { chomp;
    next if(^\s*#); # skip comments
    # split the data into the string to match and the expected result(
+s)
    my( $string, @expected)= split /\s*=>\s*/;
    while( $string=~ m{/\*                              # the delimite
+r
                        ([^*]*                          # anything but
+ the beginning of the delimiter
                                (?:\*(?!>/)             #   the beginn
+ing of the delimiter, not preceeding the rest of the delimiter
                                                        #      (?!>/) 
+means "not before /, do not use the next char)
                                      [^/]*             #   anything b
+ut the beginning of the delimiter
                                             )*)        #   repeat
                                                \*/}gx) # the end of t
+he delimiter
      { my $match= $1;
        my $expected= shift @expected || '';
        unless( $match eq $expected)
          { print "unexpected result line $.: found /$match/, expectin
+g /$expected/\n"; }
      }
  }

__DATA__
# string                  => result(s)
toto                      =>
toto /*foo*/ tata         => foo
/*foo*/ tata              => foo
toto /*foo*/              => foo
toto /*foo*bar*/          => foo*bar
toto /*foo**/             => foo*
toto /**/                 =>
/***/                     => *
/*/*/                     => /
[download]

Comment on Re: Unrolling the loop technique Select or Download Code

Replies are listed 'Best First'.
(tye)Re: Unrolling the loop technique by tye (Sage) on Jun 19, 2001 at 23:49 UTC
In this case `/(.?)/` would work but it might be slow (I think, I have not benchmarked it).* I would strongly prefer `m#/\(.?)\/#s` over the unrolled loop version if that is the whole regex (and I'd try to make that the whole regex precisely because I could then avoid using the unrolled regex). The real problem with this simple technique comes when you try to use it as part of a larger regex. For example, let's say you want to extract "comment blocks", that is, a C-style comment that starts at the beginning of a line and ends at the end of a line. Using `m#^/\(.?)\/$#msg` sure seems an easy way, and it even works for a lot of cases. However, consider this unlikely sample input: `/* This is correctly matched / / This: / runcode(); / gets included in the "comment" /` [download] which would return this list: `( " This is correctly matched ", ' This: / runcode(); /* gets included in the "comment" ' )` [download] You see that `.?` matches as little as possible but will prefer to match more if matching more will allow the entire* regex to match (or to match earlier) when less causes the regex as a whole to fail (or to match later). If I find myself wanting to use the loop unrolling technique, then I usually try to rework the problem by parsing in smaller chunks. Though, if these chunks start getting really small (like my parser starts having to deal with single characters in lots of cases), then I may use some of the simplest examples of unrolled regex loops. - tye (but my friends call me "Tye")	[reply] [d/l] [select]

Replies are listed 'Best First'.

(tye)Re: Unrolling the loop technique
by tye (Sage) on Jun 19, 2001 at 23:49 UTC

In this case /*(.*?)*/ would work but it might be slow (I think, I have not benchmarked it).

I would strongly prefer m#/\*(.*?)\*/#s over the unrolled loop version if that is the whole regex (and I'd try to make that the whole regex precisely because I could then avoid using the unrolled regex).

The real problem with this simple technique comes when you try to use it as part of a larger regex. For example, let's say you want to extract "comment blocks", that is, a C-style comment that starts at the beginning of a line and ends at the end of a line. Using m#^/\*(.*?)\*/$#msg sure seems an easy way, and it even works for a lot of cases. However, consider this unlikely sample input:

/* This is correctly matched */
/* This: */ runcode(); /* gets included in the "comment" */
[download]

( " This is correctly matched ",
  ' This: */ runcode(); /* gets included in the "comment" ' )
[download]

.*?

entire

If I find myself wanting to use the loop unrolling technique, then I usually try to rework the problem by parsing in smaller chunks. Though, if these chunks start getting really small (like my parser starts having to deal with single characters in lots of cases), then I may use some of the simplest examples of unrolled regex loops.

tye

[reply]
[d/l]
[select]

In Section Seekers of Perl Wisdom