Multiple regex matches in single string

Elijah has asked for the wisdom of the Perl Monks concerning the following question:

Got an issue where I have repeating patterns in a single string and I need a way to match on all of them. Using a while loop and the global modifier does the trick but I have one peculiarity.

The test code I have is as follows:

#!/usr/bin/perl -w

use strict;

my $string = <<END;
start
start
start
go
one
end
start
start
start
go
two
end
END

while ($string =~ /(start.+?end)/gis) {
    print $1,"\n\n";
}
[download]

This will print out something like this:

start
start
start
go
one
end

start
start
start
go
two
end
[download]

This, of course, is not what I want. I would like it to only print out:

start
go
one
end

start
go
two
end
[download]

I need somehow to ignore the preceding starts. I have tried using (?:.*) but this only prints out the second group for some reason. Any ideas?

Comment on Multiple regex matches in single string Select or Download Code

Replies are listed 'Best First'.
Re: Multiple regex matches in single string by hipowls (Curate) on Apr 26, 2008 at 04:12 UTC
You can use a negative lookahead to achieve the affect you are after `/( start # match start (?: # followed by . # anything (?!start) # not followed by start )+? # but match as little as possible end # until there is an end )/gisx` [download]	[reply] [d/l]
Re^2: Multiple regex matches in single string by Elijah (Hermit) on Apr 26, 2008 at 16:23 UTC
The negative look ahead works great for matching on the proper blocks of text. The issue this injects is an extreme slow down of the app. Without the look ahead, it runs in less than a second, with the lookahead, we are talking 10 minutes or so. My actual data file has thousands of lines of text and much more text on each line than a single word, but still, not using the negative look ahead runs against these large files in less than a second. Why would this cause such a slow down? Is there a way around this?	[reply]
Re^3: Multiple regex matches in single string by johngg (Canon) on Apr 26, 2008 at 22:22 UTC
If I am reading hipowls's regex correctly it will be checking the negative look-ahead for every character between 'start' and 'end'. Just doing the look-ahead once should locate the last 'start' in a group then the `.+?` can run without keep checking after every character. `use strict; use warnings; my $string = <<'EOT'; start start start go one end start start start go two end EOT my $rxGroup = qr {(?isx) ( start (?!\nstart) .+? end ) }; print qq{$1\n\n} while $string =~ m{$rxGroup}g;` [download] The output. `start go one end start go two end` [download] I hope I am correct and this slight change will speed up your code. Cheers, JohnGG	[reply] [d/l] [select]
Re^4: Multiple regex matches in single string by hipowls (Curate) on Apr 26, 2008 at 23:33 UTC
Re: Multiple regex matches in single string by pc88mxer (Vicar) on Apr 26, 2008 at 04:33 UTC
The reason that non-greedy matching (i.e. `.+?`) doesn't work here is because the regex is matched forward through the string. The non-greedy modifier makes the regex stop at the first `end` encountered. You would get what you wanted if you reversed both the string and regex. Of course, then you would run into the same problem if there were multiple `end` lines together. I'm not sure if this is just a toy example or not, but if this is part of real project you might consider processing the string one line at a time using a state-machine pattern. That will allow you to parse it more robustly, e.g. find unmatched `start` and `end` lines, handle nested `start`-`end` blocks, print more meaningful error messages, etc.	[reply] [d/l] [select]
Re: Multiple regex matches in single string by jethro (Monsignor) on Apr 26, 2008 at 04:11 UTC
`my @parts= split /start/,$string; shift @parts; foreach (@parts) { print "start$1\n\n" if (m{^ (.*? end ) }xms); }` [download]	[reply] [d/l]
Re: Multiple regex matches in single string by hipowls (Curate) on Apr 27, 2008 at 03:37 UTC
Given that your data file has thousands of lines you may be better choosing a different approach. I assume that `end` is always lower case. Set the input record separator to `$/ = "\nend\n"`. Now each read of a file reads in a paragraph terminated by an end by itself on a line. The advantage is that you don't need to read in the whole file to memory and it should scale better. `local $/ = "\nend\n"; while ( my $data = <DATA> ) { $data =~ s/^.start/start/is; print $data; print '=' x 10, "\n"; } __DATA__ start start start go one end start start start go two end` [download] produces `start go one end ========== start go two end ==========` [download] If my assumption of a lower case `end` is incorrect then you can achieve the same effect with `my $data = ''; while ( my $line = <DATA> ) { $data .= $line; if ( $line =~ /^end$/i ) { $data =~ s/^.start/start/is; print $data; print '=' x 10, "\n"; $data = ''; } }` [download]	[reply] [d/l] [select]


P is for Practical
	PerlMonks