Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery

look for substrings and getting their location

by wolffm (Novice)
on May 09, 2004 at 09:22 UTC ( #351821=perlquestion: print w/replies, xml ) Need Help??

wolffm has asked for the wisdom of the Perl Monks concerning the following question:

I have a file that is filled with lines that look like:
the first line is the second lines name, the second line is a DNA sequence
how do I count the number of times a specific string like: "GUAUG" occures in each line
and get each occurance location in the line

Replies are listed 'Best First'.
Re: look for substrings and getting their location
by PodMaster (Abbot) on May 09, 2004 at 09:47 UTC
      Excellent explanation of three functions that anyone doing biological research in Perl should know. :)

      You should also check out using the BioPerl modules for doing your sequence input and output, it will make your program general enough to work with many different sequence formats, not just FASTA. Here's a quick example:

      use Bio::SeqIO; my $filename = 'test.seq'; my $format = 'fasta'; my $seqio = Bio::SeqIO->new( -file => $filename, -format => $format ); while ( my $seqobj = $seqio->next_seq() ) { my $raw_sequence = $seqobj->seq; # do your searching on this raw sequence }

      Hope this helps. :)

Re: look for substrings and getting their location
by ozone (Friar) on May 09, 2004 at 11:55 UTC
    A fun one is something like:
    my $string = 'GUAUGUUUAACAGUGAUACUAAAUUUUGAACCUUUCACAAGAUUUAUCUUUAAAUA +UGUUAUGA'; my @pieces = split(/GUAUG/,$string,-1); print "count [", @pieces - 1, "]\n";
    There you have the count and with a bit of math you can figure out the locations :-D
Re: look for substrings and getting their location
by CombatSquirrel (Hermit) on May 09, 2004 at 12:08 UTC
    And, in the spirit of TIMTOWTDI, here's one without RegExes:
    #!perl use strict; use warnings; my $seq = 'GUAUGUUUAACAGUGAUACUAAAUUUUGAACCUUUCACAAGAUUUAUCUUUAAAUAUGUUAUGA'; my $search = 'UUUAA'; my $max = length($seq) - length($search); my $i = 0; my @results; while ($i < $max) { $i = index($seq, $search, $i); if ($i == -1) { last; } else { push @results, $i; } ++$i; } print @results . " match(es)\n\n"; print " $seq\n"; for (@results) { printf "%02.2d: %s%s\n", $_, ' ' x $_, $search; }
    It might be faster or not ;-).

    Hope this helped.
    Entropy is the tendency of everything going to hell.
Re: look for substrings and getting their location
by Not_a_Number (Prior) on May 09, 2004 at 14:25 UTC
    how do I count the number of times a specific string like: "GUAUG" occures in each line

    I have a question:

    For a string in your file that looks like GUAUGUAUGUAUG, how many matches do you want to count?


      all of them + location of the beginng of each one
        I think you missed the point to the followup question. Not a Number was asking if you want overlapping matches to count too. In other words, we already know you want to count all of the "GUAUG"'s in a string like: "GUAUGGUAUG".

        But do you also want to find two matches in a string with overlapping keywords? Like this: "GUAUGUAUG"... If that's the case, your RE will need zero-width lookahead. Something like this:


        That way the "pointer" is advanced one character at a time rather than one keyword at a time, thus allowing for overlapping matches.

        You can find the position (in list context) with pos. In scalar context, the special variables @+ and @- will be helpful. See perlvar for a description of them.

        On the other hand, if you want to find only one match in cases where the sequence appears to overlap, you'll have to define whether you want the left-side of the overlap to match, or the right side.


Re: look for substrings and getting their location
by ambrus (Abbot) on May 09, 2004 at 14:34 UTC

    $count= @{[$string=~ /GUAUG/g]};

    A bit inefficent if there are a lot of matches, using a /g regexp in scalar context would be better.

    Update: The locations too? Than you'll need a /g regexp in scalar context and read $-[0] after each match.

In Perl 6 that's just...
by TheDamian (Priest) on May 10, 2004 at 20:29 UTC
    $seq ~~ m:overlap/ GUAUG /; say "Found ", +@$0, " at:"; say "\t", $_.pos for @$0;

    So in Perl 5.8 and above you could write:

    use Perl6::Rules; use Perl6::Say; $seq =~ m:overlap/ GUAUG /; say "Found ", scalar(@$0), " at:"; say "\t", $_->pos for @$0;

    And, as the modifier suggests, it correctly handles overlaps in the data.

      say "\t", $_.pos for @$0;

      Can $_.pos here be written as .pos?

      Juerd # { site => '', plp_site => '', do_not_use => 'spamtrap' }

        It can. I used the longer form because I was trying to keep the Perl 5 version as similar as possible to the Perl 6 version.


Re: look for substrings and getting their location
by Anonymous Monk on May 29, 2004 at 03:14 UTC
    It seems that a number of the posts took the original question and changed it somewhat, consequently, not giving full and thorough solutions. For instance, the original question states that the data are in the following format:


    Yet, a couple of the solutions begin by setting

    $var = 'GUAUGUUUAACAGU...'
    How does one get the line name from the solution above? A solution which leaves the data in the original format and gives the line name, number of matches, and their zero-based offsets is as follows:
    #!/usr/bin/perl use warnings; use strict; my $pat = 'GUAUG'; my ($line, $times, @at); while (<DATA>) { if (/^[CGUA]+$/) { $times = () = m/$pat/g; if ($times) { eval('/^' . ('.*?($pat)' x $times) . '.*?$/; @at = @-;'); shift @at; } } else { ($line) = /^(\w+)$/; } if ($line and $times) { print "$line: $times match", $times>1 ? 'es' : ' ', " at @at\n"; $line = $times = 0; } } __DATA__ YBL027W GUAUGUUUAACAGUGAUAGUAUGUUUUGAACCUUUCACAAGAUUUAUCUUUAAAUAUGUUAUGA BBL111C UAUGUUUAACAGUGAUACUAAAUUUUGAACCUUUCACAAGAGUAUGGUAUGAAUAUGUUAUGAG ABC456T AUGUUUAACAGUGAUACUAAAUUUUGAACCUUUCACAAGAUUUAUCUUUAAAUAUGUUAUGAGU DEF789U UGUUUAACAGUGAUACUAAAUUUUGAACCUUUCACAAGAUUUAUCUUUAAAUAUGUUAUGAGUA GHI012V GUUUAACAGUGAUACUAAAUUUUGAACCUUUCACAAGAUUUAUCUUUAAAUAUGUUAUGUAUGU
    Perl was created to manipulate text. A solution to a problem such as this should be compact and easy to understand.

    I made a few assumptions:
    • All DNA sequences comprise CGUA. (I thought it was CGAT. I am not a scientist but I play one on TV.)
    • The search strings do NOT overlap.
    • The line name has at least one character that is not C, G, U, or A.
    • All lines alternate between line name and DNA sequence with the former before the latter.

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://351821]
Approved by matija
Front-paged by biosysadmin
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others musing on the Monastery: (2)
As of 2022-06-27 17:16 GMT
Find Nodes?
    Voting Booth?
    My most frequent journeys are powered by:

    Results (88 votes). Check out past polls.