Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

Re: Multiple regex matches in single string

by hipowls (Curate)
on Apr 26, 2008 at 04:12 UTC ( #683025=note: print w/ replies, xml ) Need Help??


in reply to Multiple regex matches in single string

You can use a negative lookahead to achieve the affect you are after

/( start # match start (?: # followed by . # anything (?!start) # not followed by start )+? # but match as little as possible end # until there is an end )/gisx


Comment on Re: Multiple regex matches in single string
Download Code
Re^2: Multiple regex matches in single string
by Elijah (Hermit) on Apr 26, 2008 at 16:23 UTC
    The negative look ahead works great for matching on the proper blocks of text. The issue this injects is an extreme slow down of the app. Without the look ahead, it runs in less than a second, with the lookahead, we are talking 10 minutes or so.

    My actual data file has thousands of lines of text and much more text on each line than a single word, but still, not using the negative look ahead runs against these large files in less than a second. Why would this cause such a slow down? Is there a way around this?

      If I am reading hipowls's regex correctly it will be checking the negative look-ahead for every character between 'start' and 'end'. Just doing the look-ahead once should locate the last 'start' in a group then the .+? can run without keep checking after every character.

      use strict; use warnings; my $string = <<'EOT'; start start start go one end start start start go two end EOT my $rxGroup = qr {(?isx) ( start (?!\nstart) .+? end ) }; print qq{$1\n\n} while $string =~ m{$rxGroup}g;

      The output.

      start go one end start go two end

      I hope I am correct and this slight change will speed up your code.

      Cheers,

      JohnGG

        That assumes that the starts are on consecutive lines which may be a perfectly valid assumption. It pays to know your data.

        Another approach is to use the original regex, which may have multiple starts and then trim it using s/^.*start/start/is.

        The loop then looks something like

        while ( $string =~ /(start.+?end)/gis ) { my $data = $1; $data =~ s/^.*start/start/is; print $data, "\n\n"; }
        If the intent is to strip off multiple starts only on consecutive lines then the regex would be s/^(?:start\s*)+start/start/is which used on the input
        start start start go one end start start data start go two end
        would produce
        start go one end start data start go two end
        But as I said you really need to know your data and other factors such as if you need to validate the input.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://683025]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others pondering the Monastery: (9)
As of 2014-10-22 09:59 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    For retirement, I am banking on:










    Results (114 votes), past polls