Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Multiple regex matches in single string

by Elijah (Hermit)
on Apr 26, 2008 at 02:55 UTC ( [id://683018]=perlquestion: print w/replies, xml ) Need Help??

Elijah has asked for the wisdom of the Perl Monks concerning the following question:

Got an issue where I have repeating patterns in a single string and I need a way to match on all of them. Using a while loop and the global modifier does the trick but I have one peculiarity.

The test code I have is as follows:

#!/usr/bin/perl -w use strict; my $string = <<END; start start start go one end start start start go two end END while ($string =~ /(start.+?end)/gis) { print $1,"\n\n"; }
This will print out something like this:
start start start go one end start start start go two end
This, of course, is not what I want. I would like it to only print out:
start go one end start go two end
I need somehow to ignore the preceding starts. I have tried using (?:.*) but this only prints out the second group for some reason. Any ideas?

Replies are listed 'Best First'.
Re: Multiple regex matches in single string
by hipowls (Curate) on Apr 26, 2008 at 04:12 UTC

    You can use a negative lookahead to achieve the affect you are after

    /( start # match start (?: # followed by . # anything (?!start) # not followed by start )+? # but match as little as possible end # until there is an end )/gisx

      The negative look ahead works great for matching on the proper blocks of text. The issue this injects is an extreme slow down of the app. Without the look ahead, it runs in less than a second, with the lookahead, we are talking 10 minutes or so.

      My actual data file has thousands of lines of text and much more text on each line than a single word, but still, not using the negative look ahead runs against these large files in less than a second. Why would this cause such a slow down? Is there a way around this?

        If I am reading hipowls's regex correctly it will be checking the negative look-ahead for every character between 'start' and 'end'. Just doing the look-ahead once should locate the last 'start' in a group then the .+? can run without keep checking after every character.

        use strict; use warnings; my $string = <<'EOT'; start start start go one end start start start go two end EOT my $rxGroup = qr {(?isx) ( start (?!\nstart) .+? end ) }; print qq{$1\n\n} while $string =~ m{$rxGroup}g;

        The output.

        start go one end start go two end

        I hope I am correct and this slight change will speed up your code.

        Cheers,

        JohnGG

Re: Multiple regex matches in single string
by pc88mxer (Vicar) on Apr 26, 2008 at 04:33 UTC
    The reason that non-greedy matching (i.e. .+?) doesn't work here is because the regex is matched forward through the string. The non-greedy modifier makes the regex stop at the first end encountered. You would get what you wanted if you reversed both the string and regex. Of course, then you would run into the same problem if there were multiple end lines together.

    I'm not sure if this is just a toy example or not, but if this is part of real project you might consider processing the string one line at a time using a state-machine pattern. That will allow you to parse it more robustly, e.g. find unmatched start and end lines, handle nested start-end blocks, print more meaningful error messages, etc.

Re: Multiple regex matches in single string
by jethro (Monsignor) on Apr 26, 2008 at 04:11 UTC
    my @parts= split /start/,$string; shift @parts; foreach (@parts) { print "start$1\n\n" if (m{^ (.*? end ) }xms); }
Re: Multiple regex matches in single string
by hipowls (Curate) on Apr 27, 2008 at 03:37 UTC

    Given that your data file has thousands of lines you may be better choosing a different approach. I assume that end is always lower case. Set the input record separator to $/ = "\nend\n". Now each read of a file reads in a paragraph terminated by an end by itself on a line. The advantage is that you don't need to read in the whole file to memory and it should scale better.

    local $/ = "\nend\n"; while ( my $data = <DATA> ) { $data =~ s/^.*start/start/is; print $data; print '=' x 10, "\n"; } __DATA__ start start start go one end start start start go two end
    produces
    start go one end ========== start go two end ==========

    If my assumption of a lower case end is incorrect then you can achieve the same effect with

    my $data = ''; while ( my $line = <DATA> ) { $data .= $line; if ( $line =~ /^end$/i ) { $data =~ s/^.*start/start/is; print $data; print '=' x 10, "\n"; $data = ''; } }

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://683018]
Approved by sub_chick
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others musing on the Monastery: (2)
As of 2024-04-26 01:17 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found