Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

Replace zero-width grouping?

by tinypig (Beadle)
on May 06, 2003 at 20:03 UTC ( #256024=perlquestion: print w/replies, xml ) Need Help??

tinypig has asked for the wisdom of the Perl Monks concerning the following question:

I'd like to do a substitution on a string such that, of a pattern of one character, followed by three characters, followed by that same first character; that the first character becomes an 'A', and the second instance of the first character becomes a 'B'.

given:

my $a = "17341234";

I'd like for:

$a=~s/(.)(...)\1/A$2B/g; print "$a\n";

to yeild:

A7AAB2BB

instead of what it ends up yeilding which is:

A734B234

I understand that the second substituion I want to happen never happens because it begins where the last one finished, but I'd like to find a shorter method than my final solution:

$a=~s/(.)(?=...\1)/A/g; $a=~s/(?<=A...)(.)/B/g; print "$a\n";

which does in fact yeild:

A7AAB2BB

Any ideas for a shorter solution? The shorter, the better.

Replies are listed 'Best First'.
Re: Replace zero-width grouping?
by diotalevi (Canon) on May 06, 2003 at 20:17 UTC

    Move the loop out of the regex.

    $_="17341234"; 1 while s/(.)(...)\1/A$2B/; print "$_\n"

    Added: Also - if you need to prevent restarting the regex for some reason, you can add some additional code and still get that. The addition of the eval block and the substr() might be more expensive than starting over - I'll leave that to someone else to benchmark. (and on varying size data)

    $_="17341234"; $r = 0; 1 while substr($_,$r) =~ s/(.)(...)\1(?{local $r = pos() - 3})/A$2B/; print "$_\n"
      Thanks! My wording on the problem was kind of awkward, but if there is any confusion on it, I think what your first example describes the work I want to accomplish perfectly.

      I don't think that your second varient works?

      The problem being that your pos() which is looking at the pos of $_, but you are applying the s/// to an substr lvalue, which is a different animal all together?.

      I could be wrong on this, but I can't seem to get it to work.


      Examine what is said, not who speaks.
      "Efficiency is intelligent laziness." -David Dunham
      "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller

        I tested the code before posting it. It works correctly. It turns out that it works incredibly inefficiently - the longer test string shows that this loops many more times than necessary. I accidentally had the substr() falling off the end and so it had to restart a lot. I've fixed it up and now its super-slick. The substr() always starts right after the position replaced by the 'A'. This will walk the input string and never covers the same ground twice.

        # The instrumented version my $a = "17341234173412341734123417341234"; $expected = "A7AAB2BBA7AAB2BBA7AAB2BBA7AAB2BB"; $r = 0; 1 while substr($a,$r) =~ s/(?{print ">".pos()."$r\n"})(.)(...)\1(?{$r+ +=pos()-4;print"#$r\n"})/A$2B/; print "$a\n$expected\n"; print $a eq $expected ? "Ok\n" : "Failed\n"; # The clean version my $a = "17341234173412341734123417341234"; $r = 0; 1 while substr($a,$r) =~ s/(.)(...)\1(?{$r+=pos()-4;})/A$2B/; print "$a\n"
Re: Replace zero-width grouping?
by BrowserUk (Patriarch) on May 06, 2003 at 21:49 UTC

    Your two pass solution looks fine provided the string isn't any longer than the sample you supplied, but when the string gets longer, it falls down.

    my $a = "17341234173412341734123417341234"; $a=~s/(.)(?=...\1)/A/g; $a=~s/(?<=A...)(.)/B/g; print $a; A7AAB2BBB7BBB2BBB7BBB2BBB7BBB2BB

    diotalevi's solution works much better

    my $a = "17341234173412341734123417341234"; my $n=0; $n++ while $a =~ s/(.)(...)\1/A$2B/; print $n, $a; 12 A7AAB2BBA7AAB2BBA7AAB2BBA7AAB2BB

    The downside is that as the length of the string grows, so do the number of passes, and it is having to re-scan the parts of the string it has already processed, each time through. For short strings this isn't a great problem, but if the strings are longer, then there is an alternative method that avoids it.

    my $a = "17341234173412341734123417341234"; substr($a, $_, 5) =~ s[(.)(...)\1][A$2B] for 0 .. length ($a); print $a; A7AAB2BBA7AAB2BBA7AAB2BBA7AAB2BB

    By using substr as an lvalue for the substitution, you can perform process in a single (overlapping) pass that reduces the work done by the regex engine by only looking at each group of 5 characters at a time.

    The difference in performance only really becomes evident once the string length gets above about 5 times the length of your original sample, but the technique is useful in its own right for some things.


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller

      I hoped to fix the two-pass version by adding a check to make sure this character wasn't already earmarked to become a "B" before accepting it to become an "A":

      s/(?<!A...)(.)(?=...\1)/A/g; s/(?<=A...)(.)/B/g;
      but that still fails in exactly the same way, because the regexp engine goes to some effort to make sure that the pattern is matched against the original unmodified string each time through an s///g.

      It is also unfortunate that s/// doesn't have the same support for things the //gc flags - it would have been handy to be able to solve this with something like:

      pos($_) -= 3 while s/(.)(...)\1/A$2B/gc;

      Hmm, frustrating - I feel sure there must be a simple and efficient solution.

      Hugo
        I don't know exactly if this is the type of solution you and BrowserUK were looking for (working towards), but it seems to work.. and should cut back on the rescanning thing. I left the (?{print pos($_)."\n"}) (it can be taken out if it is to much of an eyesore), to show where it starts matching, and uncommenting the use re 'debug'; line seems to confirm this. Anyhow here is the code:
        use strict; use warnings; #use re 'debug'; $_ = "17341234173412341734123417341234"; my $i; pos() = $i while s<\G (?{print pos($_) . "\n"}) (.*?)(.)(...)\2 (?{$i=pos()-4}) > <defined $1?$1."A".$3."B":"A".$3."B">ex; print; __END__ 0 1 3 4 9 11 12 17 19 20 25 27 28 A7AAB2BBA7AAB2BBA7AAB2BBA7AAB2BB
        update: FWIW I did do much benchmarking using cmpthese in the benchmark module, and it appears that after all the hand waving, and tweaks, nothing seems to run faster than BrowserUK's substr solution, and probably the one I would go with for long strings of digits. But I would make one last change tweak first. That being changing the first . to a \d .
        That is:
        substr($a, $_, 5) =~ s[(\d)(...)\1][A$2B] for 0 .. length ($a);
        Which should speed up the work the regex engine has to do as it will skip all the positions that do not start with a digit (and hence have already been turned into a letter (: ) For that matter the original could have been made better doing the same:
        s!(\d)(...)\1!A$2B!;

        -enlil

        Yes. I kept feeling there was a solution there somewhere, but be darned if I could find it.

        I tried playing games with \G as part of a positive look-behind assertion, but as is documented in the 5.8 copy of perlre, this isn't really supported. Constraining the match with a lvalue substr was the best I could come up with.


        Examine what is said, not who speaks.
        "Efficiency is intelligent laziness." -David Dunham
        "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller
      This is an iterative approach as well but more efficient:
      #!/usr/bin/perl -wl use strict; my $str = "17341234173412341734123417341234"; { local *_ = \$str; *_ = \substr $_, pos()-4 while s/(.)(...)\1/A$2B/; } print $str;

      I am exploiting the fact that assigning a ref to something to a glob creates an alias to that something; this also works for a ref to an lvalue such as substr returns. Only when a match happens, the substitution is restarted, and then in order to avoid rescanning, I hide the already processed front of the string by aliasing $_ to the unprocessed portion. In the ideal case of no match, this means the RE engine will only be invoked once, regardless of the string's length.

      Unfortunately, it doesn't work in this simple form because alas, s/// does not leave pos in a useful state after a substitution. The real code thus needs an ugly temporary variable and a messy addition to the substitution:

      my $i = 0; *_ = \substr $str, $i while s/(.)(...)\1(?{ $i += pos() - 4 })/A$2 +B/;

      The result is A7AAB2BBA7AAB2BBA7AAB2BBA7AAB2BB.

      Update: I just noticed this is essentially the same as diotalevi is doing. Doh.

      Makeshifts last the longest.

        This has taken a while because I encountered problems with benchmarking your solution. I probably would have given up at that point except that the problem itself is an interesting, in that it's indicative of a whole class of problems that require making global substitutions on a string, where the subsequent matches can overlap with previous replacements. The regex engine does not appear to have a good solution for this problem currently.

        The OP's original two pass solution doesn't scale once the matched string moves beyond the end of the original replacement.

        Taking the loop out of the regex as originally proposed by diotalevi works, but suffers from the need to start searching at the beginning of the string each time.

        The next step is to try and remember where the last match started, and arrange to start the next iteration at that position +1, but as hv pointed out, the convenience /cg modifier combo that can be applied to m//, doesn't work with s/// as pos get reset after substitutions.

        You, diotalevi and Enlil all came up with strategies for saving and restoring pos before / after substitution that work with varing degrees of success.

        I tried a crude method of moving the substitution through the string using substr and a for loop, which -- despite forcing the test against every possible n-length substring, rather than allowing the regex engine to find the matches for itself -- stands up surprisingly well to benchmarking.

        Its main advantage seems to be the simplicity. All the attempts to allow the regex engine to do the bulk of the searching (which it does so well in most cases) are offset by the need to make two or more function calls, do math and access additional variables.

        So, should anyone else but me be interested in these esoteric findings, here is the benchmark and some results. I've haven't found a way around the trap that happens when your Lvalue ref solution is called more than twice, so I've had to specifically eliminate it from most of the tests. I've included the results from a couple of 1-pass runs on very long strings (one packed and one sparse) by way of comparison.

        Every time I look at this, I still think that there is a simple solution that I am missing, but it has alluded me so far.

        Results

        Benchmark


        Examine what is said, not who speaks.
        "Efficiency is intelligent laziness." -David Dunham
        "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller
Re: Replace zero-width grouping?
by Aristotle (Chancellor) on May 07, 2003 at 06:34 UTC
    Update: Enlil points out that this suffers from the same problem as the OP's own solution, highlighted by BrowserUk.
    #!/usr/bin/perl -wl use strict; $_="17341234"; s/(.)(?=...\1)|(?<=(.)...)\2/defined $1 ? "A" : "B"/eg; print; __END__ A7AAB2BB

    Makeshifts last the longest.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://256024]
Approved by Thelonius
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others wandering the Monastery: (4)
As of 2023-03-22 12:00 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Which type of climate do you prefer to live in?






    Results (60 votes). Check out past polls.

    Notices?