Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

Bioinformatics: Regex loop, no output

by TamaDP (Initiate)
on Nov 15, 2015 at 17:57 UTC ( [id://1147737]=perlquestion: print w/replies, xml ) Need Help??

TamaDP has asked for the wisdom of the Perl Monks concerning the following question:

Hi everyone,

I'm pretty new to Perl, I'm trying to write a code for enzymatic digestion and have a problem with the regex in it; something like this:

So this would be for trypsine, it cuts after K or R, but only if there is no P after them.

#!/usr/bin/env perl use warnings; @proteins=qw(DAAAAATTLTTTAMTTTTTTCKMMFRPPPPPGGGGGGGGGGGG ALTAMCMNVWEIT +YHKGSDVNRRASFAQPPPQPPPPLLAIKPASDASD); for $protein (@proteins) { if ($protein =~ m/(KR)!P/g) { $protein =~ s/(KR)/$1=/; @new_peptides = split ('=',$protein); } } for (@new_peptides) { print "The peptide is $new_peptides\n"; }
I am getting no errors or warnings, but there is no output whatsoever, I'm guessing it has to do with the loop? Need some enlightening please. Many thanks in advance

Thanks all! Working now, and I also added more enzymes, selectable by getopts. Peptides printing alright, however, how could I track which peptide comes from which protein in order to make the printing a little more organised? I.e: Protein 1 Peptide 1 Protein 1 Peptide 2 ... Protein n Pepptide x

Replies are listed 'Best First'.
Re: Bioinformatics: Regex loop, no output
by BrowserUk (Patriarch) on Nov 15, 2015 at 17:42 UTC

    Neither of the sequences in your array contain a '!' character, so you won't get any output.

    Perhaps your regex should be m/(KR)(?!=P)/g ?


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Bioinformatics: Regex loop, no output
by GrandFather (Saint) on Nov 16, 2015 at 07:04 UTC

    The following uses look behind (?<=...) with a match set [KR]and negative look ahead (?!P) that rejects a "following P" match in a split to slice up the protein:

    use strict; use warnings; my @proteins = qw( DAAAAATTLTTTAMTTTTTTCKMMFRPPPPPGGGGGGGGGGGG ALTAMCMNVWEITYHKGSDVNRRASFAQPPPQPPPPLLAIKPASDASD DAAAAATTLTTTAMTTTTTTCK ); for my $protein (@proteins) { my @peptides = split /(?<=[KR])(?!P)/, $protein; next if @peptides < 2; print "Protein: $protein\n"; print "Peptides:\n"; print " $_\n" for @peptides; }

    Prints:

    Protein: DAAAAATTLTTTAMTTTTTTCKMMFRPPPPPGGGGGGGGGGGG Peptides: DAAAAATTLTTTAMTTTTTTCK MMFRPPPPPGGGGGGGGGGGG Protein: ALTAMCMNVWEITYHKGSDVNRRASFAQPPPQPPPPLLAIKPASDASD Peptides: ALTAMCMNVWEITYHK GSDVNR R ASFAQPPPQPPPPLLAIKPASDASD
    Premature optimization is the root of all job security
      Thanks all! Working now, and I also added more enzymes, selectable by getopts. Peptides printing alright, however, how could I track which peptide comes from which protein in order to make the printing a little more organised? I.e: Protein 1 Peptide 1 Protein 1 Peptide 2 ... Protein n Pepptide x
        In what way does GrandFather's solution (which you replied to) not do what you want?
Re: Bioinformatics: Regex loop, no output
by tonto (Friar) on Nov 15, 2015 at 18:39 UTC

    Maybe like this?

    use strict; use warnings; use Data::Dumper; my @proteins=qw( DAAAAATTLTTTAMTTTTTTCKMMFRPPPPPGGGGGGGGGGGG ALTAMCMNVWEITYHKGSDVNRRASFAQPPPQPPPPLLAIKPASDASD ); print Dumper \@proteins; my @new_peptides; for my $protein (@proteins) { if ($protein =~ m/[K(?!P)|R(?!P)]/g) { $protein =~ s/K(?!P)|R(?!P)/=/g; push @new_peptides, split ('=',$protein); } } print Dumper \@new_peptides; for (@new_peptides) { print "The peptide is $_\n"; }

    I use Data::Dumper all the time because I've been fooled by my data too many times. ;-)

      The output from your code shows some problems:

      The peptide is DAAAAATTLTTTAMTTTTTTC The peptide is MMFRPPPPPGGGGGGGGGGGG The peptide is ALTAMCMNVWEITYH The peptide is GSDVN The peptide is The peptide is ASFAQPPPQPPPPLLAIKPASDASD
      The K or R terminating split codon (if that's the proper term) is being incorrectly removed from the output peptides. (At least, I think this is incorrect. TamaDP doesn't show desired output, but seems satisfied with output examples given in various replies in this thread that include these codons.) So I assume  GSDVN should really be  GSDVNR and the "null" sequence following it should really be the single-codon sequence R. This is all down to the incorrect definition of the  s/// match pattern; take a look at some other replies in this thread for what I feel are more correct  s/// patterns.

      In an unrelated note, the regex in the condition expression of the
          if ($protein =~ m/[K(?!P)|R(?!P)]/g) { ... }
      block isn't doing what I think you think it's doing. The  [K(?!P)|R(?!P)] character class is exactly equivalent to the  [KPR()?!|] class; metacharacters (alternations, groupings, etc.) have no meaning in a character class, so  ()?!| are just literal characters (and repeated characters have no effect whatsoever). Also, the  /g modifier in the  m//g match is useless in the boolean context of a conditional, although it does no harm (except to burn a few more innocent computrons). Again, all this doesn't affect the basic problem with the code, which stems from the incorrect  s/// match.

      I use Data::Dumper all the time because I've been fooled by my data too many times.

      Yea and amen brother, yea and amen.


      Give a man a fish:  <%-{-{-{-<

        Thank you! I wondered if I understood what was wanted, later posts show that I didn't. I shouldn't have posted that, I'll stop myself next time.

Re: Bioinformatics: Regex loop, no output
by james28909 (Deacon) on Nov 15, 2015 at 21:16 UTC
    I think this line:
    if ($protein =~ m/(KR)!P/g)
    is looking for "KR"... NOT "K" or "R".

    I think what you meant to do was:
    if ($protein =~ m/(K|R)!P/g)

      As BrowserUk pointed out here, there is no  ! regex metacharacter/operator (presumably meaning "not-followed-by-the-following-pattern"). What is probably meant is  (?!pattern) (see Look-Around Assertions in perlre; also see perlretut).

      This may be what TamaDP was looking for (there are some other good solutions):

      c:\@Work\Perl\monks>perl -wMstrict -le "my @proteins = qw( DAAAAATTLTTTAMTTTTTTCKMMFRPPPPPGGGGGGGGGGGG ALTAMCMNVWEITYHKGSDVNRRASFAQPPPQPPPPLLAIKPASDASD AAAKAAA AAAKAAA XXXXXX ); ;; my @new_peptides; for my $protein (@proteins) { if ($protein =~ s{ (?<= [KR]) (?! P) }{=}xmsg) { push @new_peptides, split ('=', $protein); } } ;; for my $peptide (@new_peptides) { print qq{Peptide is '$peptide'}; } " Peptide is 'DAAAAATTLTTTAMTTTTTTCK' Peptide is 'MMFRPPPPPGGGGGGGGGGGG' Peptide is 'ALTAMCMNVWEITYHK' Peptide is 'GSDVNR' Peptide is 'R' Peptide is 'ASFAQPPPQPPPPLLAIKPASDASD' Peptide is 'AAAK' Peptide is 'AAA' Peptide is 'AAAK' Peptide is 'AAA'
      Un-s///-ubstituted input proteins are not split and push-ed into the output array.

      Yet another solution might be:

      c:\@Work\Perl\monks>perl -wMstrict -le "my @proteins = qw( DAAAAATTLTTTAMTTTTTTCKMMFRPPPPPGGGGGGGGGGGG ALTAMCMNVWEITYHKGSDVNRRASFAQPPPQPPPPLLAIKPASDASD AAAKAAA AAAKAAA XXXXXX ); ;; my $cleave = qr{ (?<= [KR]) (?! P) }xms; ;; my @peptides = map split($cleave), grep m{ $cleave }xms, @proteins ; ;; print qq{Peptide is '$_'} for @peptides; " Peptide is 'DAAAAATTLTTTAMTTTTTTCK' Peptide is 'MMFRPPPPPGGGGGGGGGGGG' Peptide is 'ALTAMCMNVWEITYHK' Peptide is 'GSDVNR' Peptide is 'R' Peptide is 'ASFAQPPPQPPPPLLAIKPASDASD' Peptide is 'AAAK' Peptide is 'AAA' Peptide is 'AAAK' Peptide is 'AAA'
      in which the central process might be documented (with the steps of the statement being taken right-to-left, bottom-to-top) as:
      my @peptides = # 4. and the pieces are peptides. map split($cleave), # 3. split at each cleavage point... grep m{ $cleave }xms, # 2. that can be cleaved, ... @proteins # 1. for each protein... ;

      Update: After sober consideration, I changed the code example above to its present form. The previous code contained
          my $cleave = qr{ [KR] (?! P) }xms;
          ...
            map  split(m{ (?<= $cleave) }xms),
            grep m{ $cleave }xms,
            ...
      which doesn't really respect the DRY principle, which is what I was aiming to exemplify.


      Give a man a fish:  <%-{-{-{-<

Re: Bioinformatics: Regex loop, no output
by Anonymous Monk on Nov 15, 2015 at 17:48 UTC
    #!/usr/bin/perl # http://perlmonks.org/?node_id=1147731 use strict; use warnings; my @new_peptides; my @proteins=qw( DAAAAATTLTTTAMTTTTTTCKMMFRPPPPPGGGGGGGGGGGG ALTAMCMNVWEITYHKGSDVNRRASFAQPPPQPPPPLLAIKPASDASD ); for my $protein (@proteins) { if ($protein =~ m/[KR](?!P)/) { $protein =~ s/[KR]\K(?!P)/=/g; push @new_peptides, split ('=',$protein); } } for (@new_peptides) { print "The peptide is $_\n"; }
Re: Bioinformatics: Regex loop, no output
by Laurent_R (Canon) on Nov 15, 2015 at 21:32 UTC
    If you don't need to keep the K or R items on which to cut the string, you could try this (demonstrated under the Perl debugger):
    DB<6> $p = 'ALTAMCMNVWEITYHKGSDVNRRASFAQPPPQPPPPLLAIKPASDASD'; DB<7> @b = split /[KR](?!P)/, $p; DB<8> x \@b; 0 ARRAY(0x6004f0f88) 0 'ALTAMCMNVWEITYH' 1 'GSDVN' 2 '' 3 'ASFAQPPPQPPPPLLAIKPASDASD'
    If you need to keep those K and R items, then this might be easier:
    DB<10> p $p ALTAMCMNVWEITYHKGSDVNRRASFAQPPPQPPPPLLAIKPASDASD DB<11> $p = 'ALTAMCMNVWEITYHKGSDVNRRASFAQPPPQPPPPLLAIKPASDASD'; DB<12> $p =~ s/([KR])(?!P)/$1=/g; DB<21> p $p ALTAMCMNVWEITYHK=GSDVNR=R=ASFAQPPPQPPPPLLAIKPASDASD
    In the latter case, you can then just split the resulting string on /=/.
Re: Bioinformatics: Regex loop, no output
by Anonymous Monk on Nov 15, 2015 at 18:29 UTC

    Using the power of map:

    #! /usr/bin/perl -wl my @proteins = qw(DAAAAATTLTTTAMTTTTTTCKMMFRPPPPPGGGGGGGGGGGG ALTAMCMN +VWEITYHKGSDVNRRASFAQPPPQPPPPLLAIKPASDASD); my %seen = map {$_ => 1} @proteins; print "Peptide $_" for grep !$seen{$_}++, map {split /[KR]\K(?!P)/} @proteins;

      Something I don't understand here. The "uniquifying" action of the  %seen hash acts to prevent the passage of un-split, whole, original proteins (like  AAAAAA in the example below) from getting through the dataflow "pipe" into the output, but it also prevents duplicated split pieces (e.g.,  AAAAK AAAA below) from the input from being output. Is this bioinformatically useful?

      c:\@Work\Perl\monks>perl -wMstrict -MData::Dump -le "my @proteins = qw(AAAAKAAAA AAAAKAAAA AAAAAA); ;; my %seen = map {$_ => 1} @proteins; ;; print qq{Peptide '$_'} for grep !$seen{$_}++, map {split /[KR]\K(?!P)/} @proteins; ;; dd \%seen; " Peptide 'AAAAK' Peptide 'AAAA' { AAAA => 2, AAAAAA => 2, AAAAK => 2, AAAAKAAAA => 1 }


      Give a man a fish:  <%-{-{-{-<

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1147737]
Approved by NetWallah
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others studying the Monastery: (3)
As of 2024-09-08 06:53 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?
    erzuuli‥ 🛈The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.