Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Puzzlement in Splitsville

by tlm (Prior)
on Jun 19, 2005 at 12:43 UTC ( #468106=perlmeditation: print w/ replies, xml ) Need Help??

This is a question disguised as a meditation disguised as a puzzle. (I think Meditations is the best section for it, but, most esteemed janitors, please feel free to move it to SoPW, or wherever else is more appropriate.)

NB: If you are a split-meister, you may want to cut to the chase, and go straight to The Question below.

The Puzzle

OK, as Perl puzzles go this is not a very hard one, but I think it will still be interesting to those who haven't already seen something like it. Find a simple way to split a string into substrings of length 3, say (the last chunk may be shorter, if the length of the string is not divisible by 3). (For simplicity, assume the string contains no newlines...or trailing 0s [thanks, Smylers].) For example, if the input string is

atgactaatagcagtgg
the output should be the list
0 'atg' 1 'act' 2 'aat' 3 'agc' 4 'agt' 5 'gg'

What trips one in such a puzzle (or at least tripped me) is the word "split" in the posing of it, which leads one immediately to think of Perl's split builtin function. It is possible to use split for this, but the only solution I know of requires a filtering through grep:

@codons = grep $_, split /(.{3})/, 'atgactaatagcagtgg'; print "@codons\n"; __END__ atg act aat agc agt gg
Note that the parens are required in the regex. (If it's not clear why, see split, in particular the role of capture in the regex argument.)

But a simpler solution requires only m//g, without any filtering:

@codons = 'atgactaatagcagtgg' =~ /.{1,3}/g; print "@codons\n"; __END__ atg act aat agc agt gg
Note that parens are not needed in this case, but it is necessary to use the range quantifier {1,3} instead of the "exact" quantifier {3}.

The Question

OK, that was all preamble to my real question, which is, is there a simple regex-based solution to split a string into "runs" of the same character? For example, if the input is 'aaabbcddddaee', then the output should be the list

0 'aaa' 1 'bb' 2 'c' 3 'dddd' 4 'a' 5 'ee'
The best I can come up with is the gangly:
@runs = do { my $i; grep ++$i%2, 'aaabbcddddaee' =~ /((.)\2*)/g }; print "@runs\n"; __END__ aaa bb c dddd a ee
I'd be interested in learning of more elegant solutions.

Update: In response to BrowserUk's question, yes order matters.

Update2: Fixed puzzle's statement, in response to Smylers' observation.

The Other Question

Incidentally, what makes my last solution so awkward is the extraneous machinery to get rid of every other item in the list returned by m//g. Is there a better idiom for selecting (or filtering out) every n-th item from a list (not an array!) of unknown length? (Of course, if an idiom requires hauling in a module, it is automatically somewhat lame, particularly if it's a non-core module.)

the lowliest monk

Comment on Puzzlement in Splitsville
Select or Download Code
Re: Puzzlement in Splitsville
by BrowserUk (Pope) on Jun 19, 2005 at 13:24 UTC

    Does order matter?

    print keys %{{ 'aaabbcddddaee' =~ m[((.)\2*)]g }}

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
    "Science is about questioning the status quo. Questioning authority".
    The "good enough" maybe good enough for the now, and perfection maybe unobtainable, but that should not preclude us from striving for perfection, when time, circumstance or desire allow.
Re: Puzzlement in Splitsville
by thundergnat (Deacon) on Jun 19, 2005 at 13:33 UTC

    I don't know if it is any more elegant....

    push @runs, $1 while 'aaabbcddddaee' =~ /((.)\2*)/g; print "@runs\n";
      Update: I misread the question.

      thundergnat's pattern doesn't work for me with my 5.6.1, it puts every character into an element.

      I had: push @ans, $1 while $str =~ /(.{3}|.{1,2}$)/g;

      Be well,
      rir

Re: Puzzlement in Splitsville
by eyepopslikeamosquito (Canon) on Jun 19, 2005 at 14:30 UTC

    thundergnat's solution is the best I can see (I came up with it independently) because it avoids the ugly removal of every 2nd item. However, because variety is the spice of life I present the ganglier:

    my @runs = do { my $i; grep ++$i%2, split /(?<=(.))(?!\1)/, 'aaabbcddd +daee' };
    which suffers from the same annoyance as your original solution: the parens, which are needed for the back reference, have the annoying side effect here of changing the semantics of what split returns.

Re: Puzzlement in Splitsville
by ysth (Canon) on Jun 19, 2005 at 18:41 UTC
    If you have a brief alphabet (intimated by your earlier example using just actg), this is easy:
    print join " ", "aaabbcddddaee" =~ /(a+|b+|c+|d+|e+)/g
Re: Puzzlement in Splitsville
by Anonymous Monk on Jun 20, 2005 at 09:20 UTC
    In Perl6, this can be done using a single, 5 Unicode-character long, operator.
Re: Puzzlement in Splitsville
by BrowserUk (Pope) on Jun 20, 2005 at 09:47 UTC
    In response to BrowserUk's question, yes order matters.

    I guess you could do the FP thing:

    sub oddEls; sub oddEls{ @_ ? (( shift, shift )[ 0 ], oddEls @_) : () } print oddEls 'aaabbcddddaee' =~ /((.)\2*)/g;; aaa bb c dddd a ee

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
    "Science is about questioning the status quo. Questioning authority".
    The "good enough" maybe good enough for the now, and perfection maybe unobtainable, but that should not preclude us from striving for perfection, when time, circumstance or desire allow.
      Your solution works only because oddEls always gets an even length list. Replace @_ ? with @_ > 1 ?, and oddEls also works with odd length lists.

        Are you sure?

        print oddEls 1 .. 9;; 1 3 5 7 9 print oddEls 1 .. 10;; 1 3 5 7 9

        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
        "Science is about questioning the status quo. Questioning authority".
        The "good enough" maybe good enough for the now, and perfection maybe unobtainable, but that should not preclude us from striving for perfection, when time, circumstance or desire allow.
Re: Puzzlement in Splitsville (/g modifier)
by demerphq (Chancellor) on Jun 20, 2005 at 11:53 UTC

    Id say you have got yourself into a situation of functional fixidity. Instead of using the list form of regexes, use the /g modifier with while. Its intended for exactly this type of scenario. Also, for your first question about splitting into triplets or what have you its probably more efficient to use unpack if you can. Im not sure but its possible 5.6 doesnt support the pattern grouping in pack formats.

    D:\>perl -le "print $1 while 'aaabbcddddaee'=~/((.)\2*)/gs" aaa bb c dddd a ee
    D:\>perl -le "print for unpack '(a3)*','atgactaatagcagtgg'"; atg act aat agc agt gg

    Note the difference in approaches between using for and while. For expects the list to be preconstructed, while expects the list to be constructed as we go.

    UPDATE: Apologies to thundergnat, I didn't notice his comment before i posted.

    UPDATE2: To answer your question about eliminating every $N th element id say you have the right approach. I would do it via:

    my @filtered=do { my $i=1; grep $i++ % $N,@unfiltered};

    ---
    $world=~s/war/peace/g

Re: Puzzlement in Splitsville
by Smylers (Pilgrim) on Jun 20, 2005 at 14:03 UTC

    It is possible to use split for this, but the only solution I know of requires a filtering through grep:

    @codons = grep $_, split /(.{3})/, 'atgactaatagcagtgg';

    As demerphq pointed out, unpack is a better way to achieve this. But if you're going to use split, note that that grep condition is wrong. In particular if the last character is a zero and it should be in an an element of its own then it will get omitted:

    my @codons = grep $_, split /(.{3})/, 'atgactaatagcagt0';

    Explicitly testing for the empty string is required:

    my @codons = grep { $_ ne '' } split /(.{3})/, 'atgactaatagcagt0';

    Smylers

      In particular if the last character is a zero and it should be in an an element of its own then it will get omitted:

      if you say grep length($_), ... you avoid this problem...

      ---
      $world=~s/war/peace/g

Re: Puzzlement in Splitsville
by Anonymous Monk on Jun 20, 2005 at 21:32 UTC
    For question 1, whatever happened to coding things the old-fashioned way?

    # to split a string $string into substrings of size $len ...

    for($i=0;$i < length($string)/$len; $i++) { $codons[$i] = substr($string,$i*$len,$len); }
    For question 2, why not something like this? No ugly regexp tricks, no ugly grep syntax, just the straighforward brute force encoding of the algorithm. Why is there an obsession with obscure one liners here, anyway?

    $start=0; for ($pos=0; $pos < length($string)-1;$pos++) { if ( substr($string,$pos,1) ne substr($string,$pos+1,1) ) { $runs[$i++] = substr($string, $start,$pos-$start+1); $start=pos; } # end check for end of run } # end loop across letters of string $runs[$i++] = substr($string,$start,$pos-$start+1);

      For question 1, whatever happened to coding things the old-fashioned way?

      Maybe because if we wanted to write verbose error prone code we would stick to C where it would at least be fast? I mean seriously do you really think your solution is easier to read or understand than

      @codons = unpack "(a3)*",$string;

      and

      my @runs; push @runs,$1 while $string=~/((.)\2*)/gs;

      I mean your code has about 5 times the informational density than mine. That says to me that there is 5 times more places to make silly errors and its 5 times more likely that some maintainer will misunderstand and break the code. And as a bonus the perlish code is going to be MUCH faster. I'll leave benchmarking it to you, but I think youll be unpleasantly surprised at how much slower your variants are.

      Even if you did for some weird reason want to avoid the natural perl way to do these things the code you have is less than desirable. Three arg for loop when you could have used for LIST and avoided potential fencepost errors etc.

      And to prove my point a close inspection of your second snippet reveals a subtle error: $start=pos; is wrong. It should $pos as pos is a special keyword used by the regex engine. A mistake that wouldn't have occured had you not been trying to write C code in Perl.

      ---
      $world=~s/war/peace/g

        YOUR CODE IS WRONG, PERIOD IT'S TOO COMPLEX, AND NOT GOING TO BE UNDERSTOOD BY *ANYONE*. YOU'RE JUST TRYING TO EXCUSE BAD CODING. I'M ***SICK*** OF THESE POST BY PEOPLE WHO DON'T UNDERSTAND ENGLISH. YOU DON'T USE LOOPS AS MODIFIERS: NO ONE SPEAKS THAT WAY. IF THEY DO, IT'S INCORRECT. YOU DON'T USE WIERD IMPLICT LOOPS WHEN YOU CAN MAKE THEM EXPLICIT AND CLEAR. PERIOD. IF YOU WANT TO WRITE OBFUSCATED CODE, WELL, GO TO HELL. THERE'S ***NO*** EXCUSE FOR WRITING CODE IN ANY OTHER WAY THAN A CORRECT, LINEAR CODE FLOW WITH PROVABLE OUTCOME. ALL THIS LISP, SUB-EXPRESSION, FUNCTION CHAIN NOTATION WAS PROVEN WRONG YEARS AGO... AND CONTINUES TO BE WRONG TODAY. MAKE EACH STEP SIMPLE, AND CORRECT. YOUR SOLUTIONS ARE INCOMPREHENSIBLE, AND THEREFORE, WRONG!! LEARN TO CODE RIGHT!!! I'M ****SICK**** OF FIXING BROKEN CODE BY SMART-ASS BASTARDS WHO FUCK EVERYTHING UP, AND LEAVE ME TO MAINTAIN THEIR UNDOCUMENTED MESSES. IMPLICIT LOOPS, IMPLICIT VARIABLES, IMPLICIT *ANYTHING* IS STATE INFORMATION THAT ISN'T OBVIOUS, AND IS HENCE WRONG. YOUR CODE IS RIDDLED WITH SO MUCH SHIT I DON'T KNOW WHERE TO BEGIN. THE NATURAL PERL WAY IS CORRECT CODING; REGARDLESS OF LANGUAGE. LEARN TO CODE, OR SHUT THE HELL UP. I'M SICK OF SMUG IDIOTS WHO DON'T GIVE A DAMN ABOUT DOING THINGS THE RIGHT WAY -- SO SHAPE UP, OR GET THE HELL OUT! I'M SICK OF YOU, AND I'M SICK OF CLEANING UP THOUSAND LINE FUNCTIONS THAT SET IMPLICIT FLAGS ALL OVER THE PLACE, JUST BECAUSE SOME SMUG IDIOT LIKES TO MAKE THINGS HARD. FUCK YOU, AND GET THE HELL OUT OF THE INDUSTRY BEFORE YOU RUIN THINGS FOR THE REST OF US.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlmeditation [id://468106]
Approved by BrowserUk
Front-paged by grinder
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others romping around the Monastery: (6)
As of 2014-09-21 18:36 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (174 votes), past polls