raygun has asked for the wisdom of the Perl Monks concerning the following question:

I am performing a substitution on strings that fall between two anchors. A certain substring -- say, "cd" -- may or may not appear as part of the string I'm matching. If it does, I need to capture it.

In my examples below, the anchors are commas, but in reality they are complex regular expressions, so that I can't just use [^,]* to avoid steamrolling over them.

In essence, the substitution will be some variation on

s/,.*(cd)?.*,/=$1=/

Making the two .* subexpressions nongreedy while keeping the (cd)? greedy would seem to express exactly what I'm trying to do, except for the slight inconvenience that it doesn't work:

echo ,abcdefg,abcdefg | perl -pe 's/,.*?(cd)?.*?,/=$1=/'

I want a captured "cd" between the two equal signs, but the $1 remains empty. The problem seems to be that .*? is apparently not merely nongreedy but also unaccommodating: It won't even consume enough to allow the following greedy subexpression to match. It's not clear to me why a nongreedy expression would consume enough to match a required subexpression (i.e. if I omitted the ? after (cd)), but not an optional but greedy one.

I can make the expression capture the "cd" if I make my first subexpression a little more explicit:

echo ,abcdefg,abcdefg, | perl -pe 's/,(?:(?!cd).)*(cd)?.*?,/=$1=/'

This gives me the desired output of "=cd=abcdefg,". But this fails if the part between the anchors does not contains a "cd":

echo ,abcefg,abcdefg, | perl -pe 's/,(?:(?!cd).)*(cd)?.*?,/=$1=/'

Here, the desired output is "==abcdefg,", but the greedy subexpression ignores the anchor boundary and goes into the section of the string following it to find a "cd".

I've tried various other things but not yet found something that works. How do I get the $1 to be populated with a "cd" if it appears in the string, and remain empty if it doesn't, while staying between the anchors?

Replies are listed 'Best First'.
Re: greedy subexpression between two nongreedy ones
by AnomalousMonk (Bishop) on Jun 02, 2015 at 05:40 UTC

    As a general technique, here's a way to emulate  [^set] where you are dealing with a potentially complex regex expression rather than a simple character class:

    c:\@Work\Perl\monks>perl -wMstrict -le "my $cd = qr{ cd }xms; my $not_cd = qr{ (?! $cd) . }xms; ;; for my $s (',abcdefg,pqrstuv', ',abefg,pqrstuv', @ARGV) { my $t = $s; print qq{'$t'}; $t =~ s{ , $not_cd* ($cd?) [^,]* , }{=$1=}xms; print qq{'$t' \n}; } " ',abcdefg,pqrstuv' '=cd=pqrstuv' ',abefg,pqrstuv' '==pqrstuv'
    Where:
      $cd is an arbitrary regex pattern;
      $not_cd matches any character that does not begin the arbitrary pattern.

    Update: Here's an example more in tune with your OPed assertion that the start and end delimiter patterns (and the optional included pattern) may be complex:

    c:\@Work\Perl>perl -wMstrict -le "my $maybe = qr{ cd? }xms; ;; my $start = qr{ A | BC | DE?F }xms; my $end = qr{ U | VW | XY?Z }xms; my $excluded = qr{ (?! $end | $maybe) . }xms; ;; for my $s ('AxxcdxxVWABCDEFUVWXYZ', 'BCxxxUABCDEFUVWXYZ', @ARGV) { my $t = $s; print qq{'$t'}; $t =~ s{ $start $excluded* ($maybe?) .*? $end }{=$1=}xms; print qq{'$t' \n}; } " BCxxcxxXZABCDEFUVWXYZ 'AxxcdxxVWABCDEFUVWXYZ' '=cd=ABCDEFUVWXYZ' 'BCxxxUABCDEFUVWXYZ' '==ABCDEFUVWXYZ' 'BCxxcxxXZABCDEFUVWXYZ' '=c=ABCDEFUVWXYZ'
    The  .*? $end could be  $excluded* $end instead. This would make the regex perhaps a bit more robust, but a bit slower.


    Give a man a fish:  <%-(-(-(-<

Re: greedy subexpression between two nongreedy ones
by Anonymous Monk on Jun 02, 2015 at 04:56 UTC

      Although this may be perfectly acceptable,  $1 will be undefined if it is not present in the string being searched:

      c:\@Work\Perl\monks>perl -wMstrict -le "for my $s (',abcdefg,pqrstuv', ',abefg,pqrstuv', @ARGV) { my $t = $s; print qq{'$t'}; $t =~ s/,(?:.*?(cd))?.*?,/=$1=/; print qq{'$t' \n}; } " ',abcdefg,pqrstuv' '=cd=pqrstuv' ',abefg,pqrstuv' Use of uninitialized value $1 in concatenation (.) or string at -e lin +e 1. '==pqrstuv'


      Give a man a fish:  <%-(-(-(-<

        s/,(?:.*(cd))?.*,/$1 ? "=$1=" : "=="/e
Re: greedy subexpression between two nongreedy ones
by Anonymous Monk on Jun 03, 2015 at 16:52 UTC
    #!/usr/bin/perl # http://perlmonks.org/?node_id=1128703 use Data::Dump qw(pp); use strict; use warnings; pp [ map # explanation of regex s/ , # find a comma .*? # match as little as possible until you find (?: # grouping for either (cd) # cd :) .*? # which must be followed by as little as possible up to the + next , # comma | # or , # a comma ) # end either grouping / # replace with $1 # test for true to avoid "uninitialized" warning ? "=$1=" # =cd= : "==" # no cd because comma was found first /exr # eval, expanded, and return result of replacement(just for t +est) # rest of running code , ',abcefg,abcdefg,', ',abcdefg,abcdefg,' ]; __END__
Re: greedy subexpression between two nongreedy ones
by Anonymous Monk on Jun 02, 2015 at 21:38 UTC

    Have you considered testing for 'cd' in the replacement?

    s/,(.*?),/=@{[$1 =~ m|(cd)| && $1 ]}=/

      I had not -- because I had no idea such a thing was possible. (So many Perl tricks to learn!) In fact, I still don't know how to find out more about this, because it's impossible to search Perl docs for a "@" and home in on a specific meaning. :-) What's the name of that @{ } construct?

      This solves my problem perfectly, and has the bonus of not further cluttering the left-hand side of an already complex substitute pattern. Thanks for the new trick!

        @{[...]} is a trick for interpolation of almost any expression into a string, explained e.g. here.

        What's the name of that @{ } construct?

        According to perlsecret @{[ ]} is the "baby cart" operator.

Re: greedy subexpression between two nongreedy ones
by Anonymous Monk on Jun 02, 2015 at 22:07 UTC

    Indeed it seems you are trying to accomplish too much with a single regex! But here's another alternative:

    s/(?&COMMA)(?:.*?(?:((?&CEEDEE)).*?(?&COMMA)|(?&COMMA)()))(?(DEFINE)(? +<COMMA>,)(?<CEEDEE>cd))/=$+=/;

Re: greedy subexpression between two nongreedy ones
by Anonymous Monk on Jun 03, 2015 at 01:38 UTC

    Just trying to see how many '?' I can get in one regex :)

    s/,.*?(?:(cd).*?)??,/=$1=/

    This will, of course, give "Use of uninitialized value $1 in concatenation (.)" if you use warnings. Here's a fix for that, if needed.

    s;,.*?(?:(cd).*?)??,;=@{[$1 // '']}=;

    or

    s/,.*?(?:(cd).*?)??,/$1 ? "=$1=" : "=="/e # without that @{[]} trick