Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

substrings that consist of repeating characters

by Anonymous Monk
on Sep 27, 2020 at 17:30 UTC ( #11122267=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi. I am studying regular expressions and wanted to write a script that searches a DNA string for the longest substrings that consist of repeating letters. For example: CCCCC or GGG or AAAA etc. I managed to do that, but i am not very happy with the end resuslt. I was hoping to get most of the work done with a regex, in that regard i have failed. Furthermore there are statements in the while loop that look doubtful, and the idea of using an array to store the substring along with its length might not be good. Any advice is welcome. Thank you.

use strict; use warnings; my $string = "AAATTTAGTTCTTAAGGCTGACATCGGTTTACGTCAGCGTTACCCCCCAAGTTATT +GGGGACTTT"; my @substrings; while($string =~ /([ACTG])(\1+)/g){ my $comb = $1.$2; my $len = length($1) + length($2); push @substrings, [$comb,$len]; } my @sorted = sort {$b->[1] <=> $a->[1]} @substrings; foreach my $substring (@sorted){ foreach my $element (@$substring){ print "$element "; } print "\n"; }

Replies are listed 'Best First'.
Re: substrings that consist of repeating characters
by tybalt89 (Prior) on Sep 27, 2020 at 20:23 UTC

    TIMTOWTDI

    #!/usr/bin/perl use strict; # https://perlmonks.org/?node_id=11122267 use warnings; my $string = "AAATTTAGTTCTTAAGGCTGACATCGGTTTACGTCAGCGTTACCCCCCAAGTTATT +GGGGACTTT"; my @substrings; push @{ $substrings[length $1] }, $1 while $string =~ /(([ACTG])\2+)/g +; my @sorted = map @{ $_ // [] }, reverse @substrings; use Data::Dump 'dd'; dd \@sorted;

    Outputs:

    [ "CCCCCC", "GGGG", "AAA", "TTT", "TTT", "TTT", "TT", "TT", "AA", "GG", "GG", "TT", "AA", "TT", "TT", ]
Re: substrings that consist of repeating characters
by GrandFather (Saint) on Sep 27, 2020 at 22:26 UTC

    In Perl length is cheap so calculate it when you need it. The following code is a little more Perlish but, other than using a threshold to drop out short strings, is similar to your code. For varieties sake the regex has changed slightly to be a little easier to grok:

    use strict; use warnings; my $string = "AAATTTAGTTCTTAAGGCTGACATCGGTTTACGTCAGCGTTACCCCCCAAGTTATT +GGGGACTTT"; my @runs; my $threshold = 3; length $1 >= $threshold && (push @runs, $1) while $string =~ /(A+|C+|G ++|T+)/g; @runs = sort {length($b) <=> length($a)} @runs; printf "@runs\n";

    Prints:

    CCCCCC GGGG AAA TTT TTT TTT
    Optimising for fewest key strokes only makes sense transmitting to Pluto or beyond
      > For varieties sake the regex has changed slightly to be a little easier to grok:

      Oh ... we got a one-liner :)

      DB<56> $_ = "AAATTTAGTTCTTAAGGCTGACATCGGTTTACGTCAGCGTTACCCCCCAAGTTAT +TGGGGACTTT"; DB<57> $threshold = 3; DB<58> x sort { length($b) <=> length($a) } grep { length >= $thresh +old } /(A+|C+|G+|T+)/g 0 'CCCCCC' 1 'GGGG' 2 'AAA' 3 'TTT' 4 'TTT' 5 'TTT' DB<59>

      EDIT

      and for the original problem

      DB<59> x sort { length($b) <=> length($a) } /(AA+|CC+|GG+|TT+)/g 0 'CCCCCC' 1 'GGGG' 2 'AAA' 3 'TTT' 4 'TTT' 5 'TTT' 6 'TT' 7 'TT' 8 'AA' 9 'GG' 10 'GG' 11 'TT' 12 'AA' 13 'TT' 14 'TT' DB<60>

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      Wikisyntax for the Monastery

Re: substrings that consist of repeating characters
by kcott (Bishop) on Sep 28, 2020 at 02:42 UTC

    TMTOWTDI

    Given biological data can be huge, using Perl's builtin string-handling functions can often be far more efficient than using regexes. Using Benchmark can help when choosing a solution.

    The following code still uses regexes but only minimally:

    #!/usr/bin/env perl use 5.014; use warnings; my $string = 'AAATTTAGTTCTTAAGGCTGACATCGGTTTACGTCAGCGTTACCCCCCAAGTTATT +GGGGACTTT'; my $min_repeat = 2; for my $base (qw{A C G T}) { say "$base: ", get_longest_length($string, $base, $min_repeat); } sub get_longest_length { my ($str, $base, $min) = @_; my $re = '[' . 'ACGT' =~ s/$base//r . ']+'; return ( sort { length $b <=> length $a } grep length $_ >= $min, split /$re/, $str )[0]; }

    Output:

    A: AAA C: CCCCCC G: GGGG T: TTT

    Notes:

    • I've specified v5.14 to use the 'r' modifier. See "perl5140delta: Non-destructive substitution".
    • You can use index to find the number and position(s) of maximum-length substring(s).
    • There are a number of optimisations that could be applied, but that will largely depend on your intended usage of this code.

    — Ken

Re: substrings that consist of repeating characters
by johngg (Canon) on Sep 28, 2020 at 11:26 UTC

    Just in case you need offsets as well, here's a solution for that.

    use strict; use warnings; use feature qw{ say }; my $string = q{AAATTTAGTTCTTAAGGCTGACATCGGTTTACGTCAGCGTTACCCCCCAAGTTATTGGGGACT +TT}; my @matches; push @matches, [ length $1, $1, $-[ 0 ] ] while $string =~ m{(([ACGT])\2+)}g; say qq{Found $_->[ 1 ], length $_->[ 0 ] at offset $_->[ 2 ]} for sort { $b->[ 0 ] <=> $a->[ 0 ] || $a->[ 1 ] cmp $b->[ 1 ] || $a->[ 2 ] <=> $b->[ 2 ] } @matches;

    The output, sorted ascending offset within ascending letter within descending length.

    Found CCCCCC, length 6 at offset 42 Found GGGG, length 4 at offset 56 Found AAA, length 3 at offset 0 Found TTT, length 3 at offset 3 Found TTT, length 3 at offset 27 Found TTT, length 3 at offset 62 Found AA, length 2 at offset 13 Found AA, length 2 at offset 48 Found GG, length 2 at offset 15 Found GG, length 2 at offset 25 Found TT, length 2 at offset 8 Found TT, length 2 at offset 11 Found TT, length 2 at offset 39 Found TT, length 2 at offset 51 Found TT, length 2 at offset 54

    I hope this is helpful.

    Cheers,

    JohnGG

Re: substrings that consist of repeating characters (updated x3)
by AnomalousMonk (Bishop) on Sep 27, 2020 at 18:23 UTC

    Win8 Strawberry 5.8.9.5 (32) Sun 09/27/2020 14:19:34 C:\@Work\Perl\monks >perl use strict; use warnings; use Data::Dump qw(dd); my $string = 'ACGTAAAAATGCCCATGGGGGGG'; my @repeats = do { my $p; grep { $p = !$p } $string =~ m{ ((.) \2+) }xmsg; }; dd \@repeats; __END__ ["AAAAA", "CCC", "GGGGGGG"]

    Update 1: But you also want lengths:

    Win8 Strawberry 5.8.9.5 (32) Sun 09/27/2020 14:20:42 C:\@Work\Perl\monks >perl use strict; use warnings; use Data::Dump qw(dd); my $string = 'ACGTAAAAATGCCCATGGGGGGG'; my @repeats_and_lengths = do { my $p; map [ $_, length ], grep { $p = !$p } $string =~ m{ ((.) \2+) }xmsg; }; dd \@repeats_and_lengths; __END__ [["AAAAA", 5], ["CCC", 3], ["GGGGGGG", 7]]
    You already know how to sort this. :)

    Update 2:

    ... there are statements in the while loop that look doubtful ...
    Other than the useless /g modifier on the /.../g regex, | oops... not useless! I don't see anything objectionable. There are usually several ways to do anything and which is "best" is often a question of taste — unless you're Benchmark-ing.
    ... the idea of using an array to store the substring along with its length might not be good.
    Again, I see nothing to gripe about. It's a matter of taste and the best impedance match to the rest of the code.

    Update 3: Oh, and one more thing... If you're doing a buncha matching operations on a buncha long sequences, it might be useful to add a validation step for each input sequence to be sure it consists only in [ATCG] characters before any further matching operations are done. This allows you to match with . (dot) and know that you can only be matching a valid base character. This might save significant time over many matches, but this can only be determined for sure by benchmarking. (I'd be inclined to add a validation step anyway just to be sure your data really is what you think it is.)


    Give a man a fish:  <%-{-{-{-<

Re: substrings that consist of repeating characters
by LanX (Sage) on Sep 27, 2020 at 20:06 UTC
    The simplest way to do it, demonstrated in the debugger

    DB<39> $_ = "AAATTTAGTTCTTAAGGCTGACATCGGTTTACGTCAGCGTTACCCCCCAAGTTAT +TGGGGACTTT"; DB<40> push @substr, $1 while /((\w)\2+)/g DB<41> @sorted = sort { length($b) <=> length($a) } @substr DB<42> x @sorted 0 'CCCCCC' 1 'GGGG' 2 'AAA' 3 'TTT' 4 'TTT' 5 'TTT' 6 'TT' 7 'TT' 8 'AA' 9 'GG' 10 'GG' 11 'TT' 12 'AA' 13 'TT' 14 'TT' DB<43>

    Storing the length in @substr for a Schwartzian transform might be faster, but I wouldn't bet on this.

    IMHO is length only doing a simple lookup of the pre-calculated length inside Perl's data-structure for strings and should be pretty fast.

    HTH! :)

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery

    update
    you could also do sort and dump in one line:

    DB<43> print join "\n", sort { length($b)<=>length($a) } @substr CCCCCC GGGG AAA TTT TTT TTT TT TT AA GG GG TT AA TT TT DB<44>
      A slight simplification can be gained by using the 'nsort_by' function from List::UtilsBy (or its XS equivalent). You can also use the special variable '$,' rather than 'join' to control the print.
      use strict; use warnings; use List::UtilsBy::XS qw(nsort_by); my $string = "AAATTTAGTTCTTAAGGCTGACATCGGTTTACGTCAGCGTTACCCCCCAAGTTATT +GGGGACTTT"; my @matches; push @matches, $& while ($string=~m/([AGCT])\1+/g); local $, = "\n"; print nsort_by {length} @matches ;
      Bill
Re: substrings that consist of repeating characters
by salva (Canon) on Sep 28, 2020 at 20:50 UTC
    A simpler variation of your code:
    use strict; use warnings; my $string = "AAAAAAATTTAGTTCTTAAGGCTGACATCGGTTTACGTCAGCGTTACCCCCCAAGT +TTTTTTTTTTTTTTTTTATTGGGGACTTT"; my $len = 0; my $best = ""; while ($string =~ /((.)\2{$len,})/g) { $len = length $1; $best = $1 } print "best: $best\n"

      At risk of upsetting likbez:

      use strict; use warnings; my $string = "AAAATTTAGTTCTTAAGGCTGACATCACGTCAGCGTTACCCCCCAAGATTGGGGAC +TTT"; my $len = 0; my $best = ''; $best = $1, $len = length $1 while $string =~ /((.)\2{$len,})/g; print "best: $best ($len)\n"

      Prints:

      best: CCCCCC (6)
      Optimising for fewest key strokes only makes sense transmitting to Pluto or beyond
      Though, note that that regular expression in my comment above is pretty inefficient as it looks for the longest match at every character instead of skipping chunks of the same character once the match fails at the character starting it (the regular expression in the OP is much better in that regard).

      We can use (*SKIP) to avoid that:

      my $len = 0; my $best = ""; while ($string =~ /((.)(?:(*SKIP)\2){$len,})/g) { $len = length $1; $best = $1 } print "best: $best\n"

      But that is still not completely efficient: the regexp is recompiled at every loop iteration because of $len, so maybe the following simpler code could be faster:

      my $best = ""; while ($string =~ /((.)\2+)/g) { $best = $1 if length $1 > length $best } print "best: $best\n"

      Or maybe this more convoluted variation:

      my $best = ""; $best = $1 while $string =~ /((.)\2*)(*SKIP)(?(?{length $^N <= length +$best})(*FAIL))/g; print "best: $best\n"

        Does that work?

        Win8 Strawberry 5.30.3.1 (64) Tue 09/29/2020 13:32:10 C:\@Work\Perl\monks >perl use strict; use warnings; my $string = 'AABBBBCCC'; my $len = 0; my $best = ""; while ($string =~ /((.)(?:(*SKIP)\2){$len,})/g) { $len = length $1; $best = $1 } print "best: '$best' \n" ^Z best: ''


        Give a man a fish:  <%-{-{-{-<

Re: substrings that consist of repeating characters
by Tux (Canon) on Sep 29, 2020 at 11:40 UTC

    I was surprised to see the $& being the fastest. At least on my perl-5.28:

    my $string = "AAATTTAGTTCTTAAGGCTGACATCGGTTTACGTCAGCGTTACCCCCCAAGTTATT +GGGGACTTT"; my %expect = qw( CCCCCC 1 GGGG 1 AAA 1 TTT 3 AA 2 GG 2 TT 5 ); use Test::More; use Benchmark qw(cmpthese); my %subs; sub v1 { %subs = (); $subs{$_}++ for grep { length >= 2 } split m/,/ => ($string =~ s/( +[ACGT])\K(?!\1)/,/gr); } # v1 sub v2 { %subs = (); $subs{$_}++ for grep m/^([ACGT])\1+$/ => split m/,/ => ($string =~ + s/(\w)\K(?!\1)/,/gr); } # v2 sub v3 { %subs = (); $subs{$_}++ for $string =~ m/(AA+|CC+|GG+|TT+)/g; } # v3 sub v4 { %subs = (); $subs{$1}++ while $string =~ m{(([ACGT])\2+)}g; } # v4 sub v5 { %subs = (); $subs{$&}++ while $string =~ m{([ACGT])\1+}g; } # v5 v1 (); is_deeply (\%subs, \%expect, "v1"); v2 (); is_deeply (\%subs, \%expect, "v2"); v3 (); is_deeply (\%subs, \%expect, "v3"); v4 (); is_deeply (\%subs, \%expect, "v4"); v5 (); is_deeply (\%subs, \%expect, "v5"); printf "%5d %3d %s\n", $subs{$_->[1]}, @$_ for sort { $b->[0] <=> $a-> +[0] || $a->[1] cmp $b->[1] } map {[ length, $_ ]} keys %subs; cmpthese (-2, { v1 => \&v1, v2 => \&v2, v3 => \&v3, v4 => \&v4, v5 => +\&v5 }); done_testing;

    =>

    ok 1 - v1 ok 2 - v2 ok 3 - v3 ok 4 - v4 ok 5 - v5 1 6 CCCCCC 1 4 GGGG 1 3 AAA 3 3 TTT 2 2 AA 2 2 GG 5 2 TT Rate v2 v1 v3 v4 v5 v2 41981/s -- -30% -52% -56% -57% v1 59864/s 43% -- -31% -38% -39% v3 87244/s 108% 46% -- -9% -12% v4 95919/s 128% 60% 10% -- -3% v5 98685/s 135% 65% 13% 3% -- 1..5

    Enjoy, Have FUN! H.Merijn
      IIRC, once perl sees $& anywhere in the program code, it starts to populate that variable (and $' and $`) for all the regular expression matches in the program. Using it impacts the performance of all the regular expressions in the code, not just those ones where it is actually needed!

        perlvar does mention the issue, but it also says this has been fully fixed since v5.20.

        Edit: so this would mean that you might still get the same relative positions for the different versions on older version of perls, because although $& would be significantly worse than the other solutions on their own, it would actually lower the performances of all other versions when used in the benchmark.

      Edit: I thought I had a better version but no. I ran the same benchmark again and the results were not the same at all (actually the three solutions had very similar performances). Something went wrong with my first attempt

      I'm actually consistantly getting result that are worse without backreferences which which I don't understand...

        my $string = "AAATTTAGTTCTTAAGGCTGACATCGGTTTACGTCAGCGTTACCCCCCAAGTTATT +GGGGACTTT"; my %expect = qw( CCCCCC 1 GGGG 1 AAA 1 TTT 3 AA 2 GG 2 TT 5 ); my $n = shift // 1; if ($n > 1) { $string = $string x $n; $_ *= $n for values %expect; } use Test::More; use Benchmark qw(cmpthese); my %subs; my @v = map { "v$_" } 1 .. 8; my %f; @f{@v} = ( sub { %subs = (); $subs{$_}++ for grep { length >= 2 } split m/,/ => ($string =~ s/( +[ACGT])\K(?!\1)/,/gr); }, # v1 sub { %subs = (); $subs{$_}++ for grep m/^([ACGT])\1+$/ => split m/,/ => ($string =~ + s/(\w)\K(?!\1)/,/gr); }, # v2 sub { %subs = (); $subs{$_}++ for $string =~ m/(AA+|CC+|GG+|TT+)/g; }, # v3 sub { %subs = (); $subs{$1}++ while $string =~ m{(([ACGT])\2+)}g; }, # v4 sub { %subs = (); $subs{$&}++ while $string =~ m{([ACGT])\1+}g; }, # v5 sub { %subs = (); $subs{$&}++ while $string =~ m{A{2,}|C{2,}|G{2,}|T{2,}}g; }, # v6 sub { %subs = (); $subs{$&}++ while $string =~ m{AA+|CC+|GG+|TT+}g; }, # v7 sub { %subs = (); $subs{$&}++ while $string =~ m{()AA+|CC+|GG+|TT+}g; }, # v8 ); for (@v) { $f{$_}->(); is_deeply (\%subs, \%expect, $_); } printf "%5d %3d %s\n", $subs{$_->[1]}, @$_ for sort { $b->[0] <=> $a-> +[0] || $a->[1] cmp $b->[1] } map {[ length, $_ ]} keys %subs; cmpthese (-2, { map {( $_ => $f{$_} )} @v }); done_testing;
        $ test.pl 1 ok 1 - v1 ok 2 - v2 ok 3 - v3 ok 4 - v4 ok 5 - v5 ok 6 - v6 ok 7 - v7 ok 8 - v8 1 6 CCCCCC 1 4 GGGG 1 3 AAA 3 3 TTT 2 2 AA 2 2 GG 5 2 TT Rate v2 v1 v7 v3 v4 v5 v6 v8 v2 41819/s -- -30% -45% -53% -57% -58% -60% -63% v1 60150/s 44% -- -21% -32% -38% -40% -43% -47% v7 76560/s 83% 27% -- -13% -22% -23% -28% -32% v3 88071/s 111% 46% 15% -- -10% -12% -17% -22% v4 97745/s 134% 63% 28% 11% -- -2% -8% -13% v5 99555/s 138% 66% 30% 13% 2% -- -6% -12% v6 105700/s 153% 76% 38% 20% 8% 6% -- -6% v8 112783/s 170% 88% 47% 28% 15% 13% 7% -- 1..8
        $ test.pl 20 ok 1 - v1 ok 2 - v2 ok 3 - v3 ok 4 - v4 ok 5 - v5 ok 6 - v6 ok 7 - v7 ok 8 - v8 20 6 CCCCCC 20 4 GGGG 20 3 AAA 60 3 TTT 40 2 AA 40 2 GG 100 2 TT Rate v2 v1 v7 v3 v4 v5 v6 v8 v2 2327/s -- -29% -47% -52% -55% -57% -61% -65% v1 3284/s 41% -- -26% -32% -37% -39% -45% -50% v7 4419/s 90% 35% -- -9% -15% -17% -26% -33% v3 4853/s 109% 48% 10% -- -7% -9% -18% -27% v4 5215/s 124% 59% 18% 7% -- -3% -12% -21% v5 5351/s 130% 63% 21% 10% 3% -- -10% -19% v6 5934/s 155% 81% 34% 22% 14% 11% -- -10% v8 6604/s 184% 101% 49% 36% 27% 23% 11% -- 1..8
        $ test.pl 2000 ok 1 - v1 ok 2 - v2 ok 3 - v3 ok 4 - v4 ok 5 - v5 ok 6 - v6 ok 7 - v7 ok 8 - v8 2000 6 CCCCCC 2000 4 GGGG 2000 3 AAA 6000 3 TTT 4000 2 AA 4000 2 GG 10000 2 TT Rate v2 v1 v7 v3 v4 v5 v6 v8 v2 21.3/s -- -35% -50% -54% -60% -61% -64% -68% v1 32.7/s 54% -- -23% -30% -38% -39% -45% -51% v7 42.6/s 100% 30% -- -9% -19% -21% -28% -36% v3 46.6/s 119% 42% 9% -- -12% -14% -21% -30% v4 52.7/s 147% 61% 24% 13% -- -2% -11% -21% v5 54.0/s 154% 65% 27% 16% 3% -- -9% -19% v6 59.2/s 178% 81% 39% 27% 13% 10% -- -11% v8 66.3/s 212% 103% 56% 42% 26% 23% 12% -- 1..8

        Enjoy, Have FUN! H.Merijn
        Note also that the benchmark results may be different for other input strings. The one in the OP is short and all the same-char substrings are also short, so for instance, results may be different if you use a long string containing long same-char substrings.
        you might want to run the different variants thru use re "debug" to see how they are translated into regex primitives. This might give you a clue what is happening.

        Cheers Rolf
        (addicted to the Perl Programming Language :)
        Wikisyntax for the Monastery

Re: substrings that consist of repeating characters
by Tux (Canon) on Sep 29, 2020 at 11:09 UTC

    Just to add to the confusion, TIMTOWTDI

    use 5.18.0; use warnings; my $string = "AAATTTAGTTCTTAAGGCTGACATCGGTTTACGTCAGCGTTACCCCCCAAGTTATT +GGGGACTTT"; my %subs; $subs{$_}++ for grep { length >= 2 } split m/,/ => ($string =~ s/([ACG +T])\K(?!\1)/,/gr); printf "%5d %3d %s\n", $subs{$_->[1]}, @$_ for sort { $b->[0] <=> $a-> +[0] || $a->[1] cmp $b->[1] } map {[ length, $_ ]} keys %subs;
    1 6 CCCCCC 1 4 GGGG 1 3 AAA 3 3 TTT 2 2 AA 2 2 GG 5 2 TT

    Enjoy, Have FUN! H.Merijn
Re: substrings that consist of repeating characters
by perlfan (Vicar) on Sep 28, 2020 at 01:37 UTC
    Given your motivating example, you may have already seen BioPerl. If not, its features and inards might be worth studying.
Re: substrings that consist of repeating characters
by vr (Curate) on Sep 29, 2020 at 17:24 UTC

    The task at hand shouts "RLE!!!" at me. General purpose RLE, efficiently (let's hope so) implemented (i.e. coded in C), accessed from Perl -- why, PDL, of course.

    The benchmark below is probably very skewed because my test DNA consists of only short same base (nucleotide) fragments. Let's assume the ultimate goal is length of longest "C's" string and its position. The only other contestant is salva's code, modified to fit stated purpose. Sorry if I missed faster other monks' solution.

    Note: sneaking Perl's scalar as PDL raw data looks hackish, which it is. Opening scalar as filehandle and then using readflex to stuff PDL raw data is, alas, too slow.

    use strict; use warnings; use Time::HiRes 'time'; use Readonly; Readonly my $SIZE => 10_000_000; my $str; { # get us some data use String::Random 'random_regex'; srand 1234; $str = random_regex( "[ACTG]{$SIZE}" ); } { print "\nlet's test PDL!\n"; use PDL; my $t = time; my $p = PDL-> new_from_specification( byte, $SIZE ); ${ $p-> get_dataref } = $str; $p-> upd_data; my ( $lengths, $values ) = $p-> rle; my $cumu = $lengths-> cumusumover; my $C_lengths = $lengths * ( $values == ord 'C' ); my ( undef, $max, undef, $max_ind ) = $C_lengths-> minmaximum; report( $max, $cumu-> at( $max_ind - 1 ), time - $t ) } { print "\nlet's test pure Perl's re-engine!\n"; my $t = time; my $best = [ -1, -1 ]; while ( $str =~ /((C)\2+)/g ) { $best = [ length( $1 ), $-[ 1 ]] if length $1 > $best-> [ 0 ] } report( @$best, time - $t ) } sub report { printf "\tmax run of C's is %d bases long at %d\n\ttime consumed: +%f\n", @_ } __END__ let's test PDL! max run of C's is 11 bases long at 4367281 time consumed: 0.164513 let's test pure Perl's re-engine! max run of C's is 11 bases long at 4367281 time consumed: 0.361907
Re: substrings that consist of repeating characters
by wazat (Monk) on Sep 30, 2020 at 23:47 UTC

    While I wouldn't recommend it, I didn't see the following approach:

    my $string = 'AAATTTAGTTCTTAAGGCTGACATCGGTTTACGTCAGCGTTACCCCCCAAGTTATT +GGGGACTTT'; my %max; $string =~ s[(.)\1*][if ( length($&) > ($max{$1} // 0) ) {$max{$1} = l +ength($&); }]eg; for my $k (sort keys %max) { print"$max{$k} ", $k x $max{$k}, "\n"; }

    output

    3 AAA 6 CCCCCC 4 GGGG 3 TTT
Re: substrings that consist of repeating characters
by Anonymous Monk on Sep 28, 2020 at 05:55 UTC

    Tx for the response. You have given me a lot to think about

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11122267]
Front-paged by Arunbear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (4)
As of 2021-10-21 18:34 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    My first memorable Perl project was:







    Results (83 votes). Check out past polls.

    Notices?