http://www.perlmonks.org?node_id=512768

pKai has asked for the wisdom of the Perl Monks concerning the following question:

Dearest monks,

my journey on the road to Perl wisdom has led me to your gates. I'm here to present to you a little innocent problem, which nonetheless gave me some pain in thinking about it.

The task which torments me is the following:

"Given a string, e. g. "xx556xx", split it into an array, where the split strikes between every non-identical characters, i. e. getting qw(xx 55 6 xx) from the example string above."

That seems easy enough. A first approach with a for-loop might look like this
sub seq1 { my $r; my $r0; my @R; for my $c (split '', shift) { unless (defined $r) { $r = $c; $r0 = $c; } elsif($c eq $r0) { $r .= $c; } else { push @R, $r; $r = $c; $r0 = $c; } } push @R, $r if length $r; @R; }
But what is with Perl's promise of "simple things easy"?

Let's look at bigger shells then: Regexes:

sub seq2 { my @x = shift =~ m/((.)\2*)/g; map $x[2*$_], 0..@x/2-1; }
The 1st line is actually mine, while credit for the 2nd goes to murphy on the German perl-community board, fixing an oversight by me.

This looks quite good, and Benchmark even suggests, that it has a little edge in performance over the for-driven sub.

Alas! it would be perfect, if no postprocessing was necessary on the match expression result!

Oh enlighted monks, is any such "easy expression for the simple problem" laid down in your holy books?

Footnote: granted that readability and elegance are to a great extend functions of individual perception, I was nonetheless frustrated that I could not dig up an "elegant" 1-liner for the described problem so far.

Replies are listed 'Best First'.
Re: Elegant way to split into sequences of identical chars? (while //g)
by tye (Sage) on Nov 29, 2005 at 21:01 UTC
    sub splitSameChars { for( shift @_ ) { push @_, $1 while /((.)\2*)/g; } @_; }

    - tye        

      For me, easier to understand written so:
      use strict; use warnings; use Data::Dumper; print join " ", splitSameChars('xx556xx'); sub splitSameChars { my $letters = shift; push @_, $1 while $letters =~ /((.)\2*)/g; @_; }
      (The for loop confused me.)
        (The for loop confused me.)

        Ah, but it wasn't a for loop. It was a for aliasing. This can be a very useful technique, so you might want to get used to it. Granted, in this case, it was done to try to better meet the original request for 'an "elegant" 1-liner', and isn't the way I would usually write code (and not even the way I originally posted the code, before I noticed the quoted part of the original request).

        - tye        

Re: Elegant way to split into sequences of identical chars?
by japhy (Canon) on Nov 29, 2005 at 21:23 UTC
    This is a little crufty, I'll admit, but I like it!
    sub splitter { local $_ = shift; split /(?<!^)(?!(??{ quotemeta substr($_, $-[0]-1, 1) }))/; }
    This splits a string at all locations that 1) are not the beginning of the string, and 2) are not followed by the character immediately preceding this location.

    Jeff japhy Pinyan, P.L., P.M., P.O.D, X.S.: Perl, regex, and perl hacker
    How can we ever be the sold short or the cheated, we who for every service have long ago been overpaid? ~~ Meister Eckhart
      A little less obfuscated if you use extended regex comments. (I liked this solution a lot, because I feel like it gets to the "core" of what is wanted.)

      use strict; use warnings; use Data::Dumper; print join " ", splitter('xx((556xx'); sub splitter { local $_ = shift; split / (?<!^) #not preceeded by + start of string. (?! #not followed by... (??{ quotemeta substr($_, $-[0]-1, 1) # the escaped +(quote-metad) last character of the last match. # note: $-[0] is th +e offset of start of last successful match. # $-[1] (not used h +ere) would be the offset of start of the first sub pattern. }) ) /x; }
Re: Elegant way to split into sequences of identical chars?
by ysth (Canon) on Nov 30, 2005 at 04:34 UTC
    I don't have a perl to test with right now, but I think this will work:
    $repeater = qr/(.)\1+/; @matches = $string =~ /((??{$repeater}))/g;

      With aminor change to qr/(.)\1*/; to pick non repeated characters, it works fine.

      Very elegant++.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        ingenious!

        And not only

        elegant++
        but also by far the fastest solution so far, provided Benchmark not lying to me.

        Since it is also very compact already, we get the most compact variant so far with

        /((??{'(.)\1*'}))/g

        This is not as fast as the precompiled regex, of course, but still faster than the other snippets seen.

      That's really cool, but I'm confused as to why it works. From perlre about (??{ code }):

      This is a "postponed" regular subexpression. The "code" is evaluated at run time, at the moment this subexpression may match. The result of evaluation is considered as a regular expression and matched as if it were inserted instead of this construct.

      Maybe I'm thrown by the "matched as if it were inserted" part when what's being inserted is a regular expression -- if it were just inserted, I don't understand why your approach works different from just putting the subexpression in directly. It looks like it's being evaluated somehow "separately" from the rest of the regular expression, which I didn't expect given the doc. I don't know if that makes sense, but this code sample illustrates where I'm thrown:

      my $re = qr/(.)\1*/; @matches = $string =~ m/($re)/g; print "@matches\n"; # x x x x 5 5 5 5 6 6 x x x x @matches = $string =~ m/((??{$re}))/g; print "@matches\n"; # xx 55 6 xx

      Could you please help me understand what's really going on that's different between these two cases?

      -xdg

      Code written by xdg and posted on PerlMonks is public domain. It is provided as is with no warranties, express or implied, of any kind. Posted code may not have been tested. Use of posted code is at your own risk.

        Please try this to see more of what is going on here:
        use warnings; use strict; use re 'debug'; my $re = qr/(.)\1*/; my @matches; my $string = "xx556xx"; @matches = $string =~ m/($re)/g; print "@matches\n\n"; # x x x x 5 5 5 5 6 6 x x x x @matches = $string =~ m/((??{$re}))/g; print "@matches\n\n"; # xx 55 6 xx
        It looks like a question of greedy star vs. returning the first possible match and if \1 is evaluated before returning the match to me, but I dont feel much confidence to guess any further.
Re: Elegant way to split into sequences of identical chars?
by Roy Johnson (Monsignor) on Nov 29, 2005 at 21:53 UTC
    I know, obfuscated != elegant. Nevertheless, I thought it was kind of neat.
    my $str = 'xx556xx'; my @x = grep --$|, $str =~ m/((.)\2*)/g; print join ',', @x;
    The trick with $| is that it's a toggle, so grep throws away every other member of its input. The same thing could be done with a normal variable like so:
    my $str = 'xx556xx'; my $tog = 0; my @x = grep $tog = !$tog, $str =~ m/((.)\2*)/g; print join ',', @x;
    or using split:
    my $str = 'xx556xx'; my $tog = 0; my @x = grep $tog = !$tog, split /(?<=(.))(?!\1)/, $str; print join ',', @x;

    Caution: Contents may have been coded under pressure.

      I was fascinated by the use of the $| "output autoflush". I usually use '$| = 1;' or '++$|;' for the scripts I write on Windows. So I tried the same.

      my $str = 'xx556xx'; my @x = grep --$|, $str =~ m/((.)\2*)/g; my @y = grep ++$|, $str =~ m/((.)\2*)/g; print join ',', @x; print "\n"; print join ',', @y;

      The output being:

      xx,55,6,xx

      xx,x,55,5,6,6,xx,x

      I was somewhat suprised as all the documentation I've seen for '$|' says a non-zero value will force fflush after every 'write' or 'print' statement. I assumed the behavior would be the same for positive and negative values. What gives?

      Mike

        I've seen for '$|' says a non-zero value will force fflush after every 'write' or 'print' statement.
        For a better explanation, see the thread Perl Idioms Explained - $|++, especially where it talks about ++ and --.

        -QM
        --
        Quantum Mechanics: The dreams stuff is made of

Re: Elegant way to split into sequences of identical chars?
by thundergnat (Deacon) on Nov 29, 2005 at 21:39 UTC
      ... perhaps the best of which is Anonymous Monk's "In Perl6, this can be done using a single, 5 Unicode-character long, operator."

      Jeff japhy Pinyan, P.L., P.M., P.O.D, X.S.: Perl, regex, and perl hacker
      How can we ever be the sold short or the cheated, we who for every service have long ago been overpaid? ~~ Meister Eckhart

        Notice I didn't say lot's of GOOD answers there...

        Instead, there's the whole gamut of answers. A couple examples of good idiomatic perl, a few obscure hackish answers, one of C written in perl and at least one of someone who seriously shouldn't have skipped his meds that morning.

Re: Elegant way to split into sequences of identical chars?
by pKai (Priest) on Nov 29, 2005 at 22:54 UTC
    Thank you very, very much to all repliers.

    Sorry, I was not aware of the meditation from June. Though I think my inquiry gave room for deepening the insight and even showed new ideas (japhy's demonstration of using (??{})).

    I think I still need some time to elaborate if, where and why I now saw the ultimate elegant solution to the problem. At least it seem clear that I don't have to wait for the "5 Unicode-character long operator from Perl6" ;-)

    Thank you and good night for now.

Re: Elegant way to split into sequences of identical chars?
by xdg (Monsignor) on Nov 30, 2005 at 03:49 UTC
    Alas! it would be perfect, if no postprocessing was necessary on the match expression result!

    Hardly elegant, but along the lines of the many other worthy entries. (Awful hack with $a to avoid declaring it.) Does not have to throw anything away. And probably benchmarks worse given the perl evaluation each time through!

    my $line = "xx556xx"; my @list = $line =~ m/((??{ $a = substr( $line, pos($line), 1); $a ? q +uotemeta($a) . '+' : '$' }))/g; print "@list";

    -xdg

    Code written by xdg and posted on PerlMonks is public domain. It is provided as is with no warranties, express or implied, of any kind. Posted code may not have been tested. Use of posted code is at your own risk.

Re: Elegant way to split into sequences of identical chars?
by ivancho (Hermit) on Nov 30, 2005 at 10:19 UTC
    clearly not as clean as the regexp solutions, but just for laughs
    perl -le '/$c/||$i++, $d[$i].=$c=$_ for split "","aaaabbbccdaaa"; prin +t "@d"' aaaa bbb cc d aaa
Re: Elegant way to split into sequences of identical chars?
by tphyahoo (Vicar) on Nov 30, 2005 at 14:54 UTC
    This is almost a repeat of tye's answer, but without peking at the others I came up with
    use strict; use warnings; use Data::Dumper; my @repeats; while ( 'xx556xx' =~ m/((.)(\2)*)/g ) { push @repeats, $1; } print Dumper(\@repeats);

      Forgot to post my solution early, but its similar to yours so i'll add it here.

      use Data::Dumper; my $test = "xx556xxyyy"; my @matches ; push @matches, $1 while ($test =~ /((.)\2*)/g); print Dumper(@matches);

      These solutions have the added benefit of being able to switch to just a list of chars by changeins $1 to $2. Not sure thats of any use but hey. lol. I'm mildly suprised there is no regex command to match last char, or a way to specify which groups to return and with are just used internal to the regex, I guess the need doesn't arise that often.


      ___________
      Eric Hodges $_='y==QAe=e?y==QG@>@?iy==QVq?f?=a@iG?=QQ=Q?9'; s/(.)/ord($1)-50/eigs;tr/6123457/- \/|\\\_\n/;print;