http://www.perlmonks.org?node_id=939987

neversaint has asked for the wisdom of the Perl Monks concerning the following question:

Dear Masters,
I have string of length in multiple of 3.
my $seq = "CTTCGAATT"; # in this case length of 9
Is there a way I can split it into equal length of 3? Such that in the end I have this array:
$VAR = ["CTT", "CGA", "ATT"];
In practice I have around 10 million of such string to break. I can always use iteration with 'substr' command. But it's very slow. Is there a faster way to do it?

---
neversaint and everlastingly indebted.......

Replies are listed 'Best First'.
Re: Fast Way to Split String in to Chunk of Equal Length
by ikegami (Pope) on Nov 25, 2011 at 07:22 UTC
    my @parts = $seq =~ /.{3}/sg;
    my @parts = unpack('(a3)*', $seq);

    You didn't specify how you want to handle strings whose length isn't a multiple of three, so I presume it can't happen.

Re: Fast Way to Split String in to Chunk of Equal Length
by davido (Cardinal) on Nov 25, 2011 at 08:11 UTC

    Are you sure that chunking the strings is where your performance bottleneck is? Have you profiled? Could IO be a more significant constraint to execution time? I only ask because it doesn't seem like the performance of substr for 10,000,000 strings is all that bad for the problem domain.

    If this section of code is really significant, here's a comparison of the valid solutions provided up to this point in the thread. Naturally unpack wins. It's only a few seconds slower in 10,000,000 iterations than the "control" case (which isn't a solution, but just a check to see what the framework for each solution costs).

    Here's the output.

    Benchmark: timing 10000000 iterations of control, match, substr, unpac +k... control: 5 wallclock secs ( 5.65 usr + 0.00 sys = 5.65 CPU) @ 17 +69911.50/s (n=10000000) match: 22 wallclock secs (21.12 usr + 0.00 sys = 21.12 CPU) @ 47 +3484.85/s (n=10000000) substr: 12 wallclock secs (11.88 usr + 0.00 sys = 11.88 CPU) @ 84 +1750.84/s (n=10000000) unpack: 8 wallclock secs ( 9.04 usr + 0.00 sys = 9.04 CPU) @ 11 +06194.69/s (n=10000000)

    By the way: This question was crossposted to StackOverflow here.


    Dave

Re: Fast Way to Split String in to Chunk of Equal Length
by CountZero (Bishop) on Nov 25, 2011 at 07:16 UTC
    I am not sure it is fast but it works.
    use Modern::Perl; use Data::Dump qw/dump/; my $seq = "CTTCGAATT"; # in this case length of 9 my @strings = $seq =~ /(.{3})/g ; say dump(@strings);
    Output:
    ("CTT", "CGA", "ATT")

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

      I think your code is just a tad too example specific. I doubt that in his real data the string will always be 9 characters. But it will always be a multiple of 3.

      If it is always 9 characters, then your code is as good as an alternatives. update Wait a minute, did you just change your code? If not, I must have replied to the wrong node.

        Nope, did not change my code at all.

        CountZero

        A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

Re: Fast Way to Split String in to Chunk of Equal Length
by BrowserUk (Pope) on Nov 25, 2011 at 10:31 UTC

    unpack is the way to go, but one tip is that unpack 'a3' x $n, $str runs a few percent quicker than unpack '(a3)*', $str if you know in advance how many fields there are.

    Also, if your problem allows you to unpack the strings just in time at the point where you need them, rather than en masse, avoiding the allocation, creation and destruction of 10 million concurrent small arrays -- even if you have ample memory for them -- can save a significant amounts of time.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      unpack is the way to go, but one tip is that unpack 'a3' x $n, $str runs a few percent quicker than unpack '(a3)*', $str if you know in advance how many fields there are.

      How can you not know in advance how many fields there are (int(length($seq)/3))?

      avoiding the allocation, creation and destruction of 10 million concurrent small arrays -- even if you have ample memory for them -- can save a significant amounts of time

      Any implementation in mind? Something like the following?

      while ($seq =~ /(.{3})/sg) { # Substring in $1 ... }

        Assuming your question is in regard to:

        Also, if your problem allows you to unpack the strings just in time at the point where you need them, rather than en masse, avoiding the allocation, creation and destruction of 10 million concurrent small arrays -- even if you have ample memory for them -- can save a significant amounts of time.

        I meant -- unpack the strings just in time at the point where you need them -- something like:

        for my $seq ( @seqs ) { my @bits = unpack 'a3a3a3', $seq; ## $bits[ ... ]; }

        rather than -- en masse --:

        @seqs = map[ unpack '(a3)*', $_ ], @seqs; for my $ref ( @seqs ) { ## $ref->[ ... ] }

        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        At it again. Unannounced updates. The above post originally consisted entirely of just:

        Any implementation in mind? Something like the following?
        while ($seq =~ /(.{3})/sg) { # Substring in $1 ... }

        To answer the dumb question that wasn't there originally,

        How can you not know in advance how many fields there are (int(length($seq)/3))?

        If you have to calculate the number, you throw away the few percent gained.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Fast Way to Split String in to Chunk of Equal Length
by TJPride (Pilgrim) on Nov 25, 2011 at 11:58 UTC
    I would think that substr would be fast enough to do what you need. The main bottleneck, as BrowserUK said, is probably not substr, but rather disk read (if you're reading one record at a time) or memory allocation. On the former, you could try using read() and a buffer variable. Be simple enough to read in some large multiple of 10 (9 + separator character) and then generate directly from that:

    use strict; use warnings; my ($handle, $buffer, $size, @parsed, $i); open($handle, 'my-data.txt'); while ($size = read($handle, $buffer, 30)) { for ($i = 0; $i < $size; $i += 10) { push @parsed, [ substr($buffer, $i, 3), substr($buffer, $i+3, 3), substr($buffer, $i+6, 3) ]; } }
      Note that I was testing with a read of 30 - should have modified it back to 10240 before posting.
Re: Fast Way to Split String in to Chunk of Equal Length
by pvaldes (Chaplain) on Nov 25, 2011 at 12:35 UTC

    Similar, but better like this, less error prone (and a little faster probably)

    my $seq = qw(CCCTCTCTCCCAGGGGCA); my @parts = $seq =~ /[CGTAU]{3}/sg; foreach my $codon(@parts){print $codon," "}
Re: Fast Way to Split String in to Chunk of Equal Length
by pvaldes (Chaplain) on Nov 25, 2011 at 12:37 UTC

    note also that in this code the last non-triplet is silently discarded.

    my $seq = qw(CCCTCTCTCCCAGGGGCAAA); prints: CCC TCT CTC CCA GGG GCA AA discarded
    This could be, or not, what you want, so maybe you'll need to fix this

A reply falls below the community's threshold of quality. You may see it by logging in.