TamaDP has asked for the wisdom of the Perl Monks concerning the following question:
Hi everyone,
I'm pretty new to Perl, I'm trying to write a code for enzymatic digestion and have a problem with the regex in it; something like this:
So this would be for trypsine, it cuts after K or R, but only if there is no P after them.
#!/usr/bin/env perl
use warnings;
@proteins=qw(DAAAAATTLTTTAMTTTTTTCKMMFRPPPPPGGGGGGGGGGGG ALTAMCMNVWEIT
+YHKGSDVNRRASFAQPPPQPPPPLLAIKPASDASD);
for $protein (@proteins) {
if ($protein =~ m/(KR)!P/g) {
$protein =~ s/(KR)/$1=/;
@new_peptides = split ('=',$protein);
}
}
for (@new_peptides) {
print "The peptide is $new_peptides\n";
}
I am getting no errors or warnings, but there is no output whatsoever, I'm guessing it has to do with the loop? Need some enlightening please.
Many thanks in advance
Thanks all! Working now, and I also added more enzymes, selectable by getopts. Peptides printing alright, however, how could I track which peptide comes from which protein in order to make the printing a little more organised? I.e: Protein 1 Peptide 1 Protein 1 Peptide 2 ... Protein n Pepptide x
Re: Bioinformatics: Regex loop, no output
by BrowserUk (Patriarch) on Nov 15, 2015 at 17:42 UTC
|
Neither of the sequences in your array contain a '!' character, so you won't get any output.
Perhaps your regex should be m/(KR)(?!=P)/g ?
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] |
Re: Bioinformatics: Regex loop, no output
by GrandFather (Saint) on Nov 16, 2015 at 07:04 UTC
|
The following uses look behind (?<=...) with a match set [KR]and negative look ahead (?!P) that rejects a "following P" match in a split to slice up the protein:
use strict;
use warnings;
my @proteins = qw(
DAAAAATTLTTTAMTTTTTTCKMMFRPPPPPGGGGGGGGGGGG
ALTAMCMNVWEITYHKGSDVNRRASFAQPPPQPPPPLLAIKPASDASD
DAAAAATTLTTTAMTTTTTTCK
);
for my $protein (@proteins) {
my @peptides = split /(?<=[KR])(?!P)/, $protein;
next if @peptides < 2;
print "Protein: $protein\n";
print "Peptides:\n";
print " $_\n" for @peptides;
}
Prints:
Protein: DAAAAATTLTTTAMTTTTTTCKMMFRPPPPPGGGGGGGGGGGG
Peptides:
DAAAAATTLTTTAMTTTTTTCK
MMFRPPPPPGGGGGGGGGGGG
Protein: ALTAMCMNVWEITYHKGSDVNRRASFAQPPPQPPPPLLAIKPASDASD
Peptides:
ALTAMCMNVWEITYHK
GSDVNR
R
ASFAQPPPQPPPPLLAIKPASDASD
Premature optimization is the root of all job security
| [reply] [d/l] [select] |
|
Thanks all! Working now, and I also added more enzymes, selectable by getopts.
Peptides printing alright, however, how could I track which peptide comes from which protein in order to make the printing a little more organised?
I.e:
Protein 1 Peptide 1
Protein 1 Peptide 2 ...
Protein n Pepptide x
| [reply] |
|
In what way does GrandFather's solution (which you replied to) not do what you want?
| [reply] |
|
|
|
Re: Bioinformatics: Regex loop, no output
by tonto (Friar) on Nov 15, 2015 at 18:39 UTC
|
use strict;
use warnings;
use Data::Dumper;
my @proteins=qw(
DAAAAATTLTTTAMTTTTTTCKMMFRPPPPPGGGGGGGGGGGG
ALTAMCMNVWEITYHKGSDVNRRASFAQPPPQPPPPLLAIKPASDASD
);
print Dumper \@proteins;
my @new_peptides;
for my $protein (@proteins) {
if ($protein =~ m/[K(?!P)|R(?!P)]/g) {
$protein =~ s/K(?!P)|R(?!P)/=/g;
push @new_peptides, split ('=',$protein);
}
}
print Dumper \@new_peptides;
for (@new_peptides) {
print "The peptide is $_\n";
}
I use Data::Dumper all the time because I've been fooled by my data too many times. ;-) | [reply] [d/l] |
|
The peptide is DAAAAATTLTTTAMTTTTTTC
The peptide is MMFRPPPPPGGGGGGGGGGGG
The peptide is ALTAMCMNVWEITYH
The peptide is GSDVN
The peptide is
The peptide is ASFAQPPPQPPPPLLAIKPASDASD
The K or R terminating split codon (if that's the proper term) is being incorrectly removed from the output peptides. (At least, I think this is incorrect. TamaDP doesn't show desired output, but seems satisfied with output examples given in various replies in this thread that include these codons.) So I assume GSDVN should really be GSDVNR and the "null" sequence following it should really be the single-codon sequence R. This is all down to the incorrect definition of the s/// match pattern; take a look at some other replies in this thread for what I feel are more correct s/// patterns.
In an unrelated note, the regex in the condition expression of the
if ($protein =~ m/[K(?!P)|R(?!P)]/g) { ... }
block isn't doing what I think you think it's doing. The [K(?!P)|R(?!P)] character class is exactly equivalent to the [KPR()?!|] class; metacharacters (alternations, groupings, etc.) have no meaning in a character class, so ()?!| are just literal characters (and repeated characters have no effect whatsoever). Also, the /g modifier in the m//g match is useless in the boolean context of a conditional, although it does no harm (except to burn a few more innocent computrons). Again, all this doesn't affect the basic problem with the code, which stems from the incorrect s/// match.
I use Data::Dumper all the time because I've been fooled by my data too many times.
Yea and amen brother, yea and amen.
Give a man a fish: <%-{-{-{-<
| [reply] [d/l] [select] |
|
| [reply] |
Re: Bioinformatics: Regex loop, no output
by james28909 (Deacon) on Nov 15, 2015 at 21:16 UTC
|
if ($protein =~ m/(KR)!P/g)
is looking for "KR"... NOT "K" or "R".
I think what you meant to do was:
if ($protein =~ m/(K|R)!P/g)
| [reply] [d/l] [select] |
|
As BrowserUk pointed out here, there is no ! regex metacharacter/operator (presumably meaning "not-followed-by-the-following-pattern"). What is probably meant is (?!pattern) (see Look-Around Assertions in perlre; also see perlretut).
This may be what TamaDP was looking for (there are some other good solutions):
c:\@Work\Perl\monks>perl -wMstrict -le
"my @proteins = qw(
DAAAAATTLTTTAMTTTTTTCKMMFRPPPPPGGGGGGGGGGGG
ALTAMCMNVWEITYHKGSDVNRRASFAQPPPQPPPPLLAIKPASDASD
AAAKAAA AAAKAAA XXXXXX
);
;;
my @new_peptides;
for my $protein (@proteins) {
if ($protein =~ s{ (?<= [KR]) (?! P) }{=}xmsg) {
push @new_peptides, split ('=', $protein);
}
}
;;
for my $peptide (@new_peptides) {
print qq{Peptide is '$peptide'};
}
"
Peptide is 'DAAAAATTLTTTAMTTTTTTCK'
Peptide is 'MMFRPPPPPGGGGGGGGGGGG'
Peptide is 'ALTAMCMNVWEITYHK'
Peptide is 'GSDVNR'
Peptide is 'R'
Peptide is 'ASFAQPPPQPPPPLLAIKPASDASD'
Peptide is 'AAAK'
Peptide is 'AAA'
Peptide is 'AAAK'
Peptide is 'AAA'
Un-s///-ubstituted input proteins are not split and push-ed into the output array.
Yet another solution might be:
c:\@Work\Perl\monks>perl -wMstrict -le
"my @proteins = qw(
DAAAAATTLTTTAMTTTTTTCKMMFRPPPPPGGGGGGGGGGGG
ALTAMCMNVWEITYHKGSDVNRRASFAQPPPQPPPPLLAIKPASDASD
AAAKAAA AAAKAAA XXXXXX
);
;;
my $cleave = qr{ (?<= [KR]) (?! P) }xms;
;;
my @peptides =
map split($cleave),
grep m{ $cleave }xms,
@proteins
;
;;
print qq{Peptide is '$_'} for @peptides;
"
Peptide is 'DAAAAATTLTTTAMTTTTTTCK'
Peptide is 'MMFRPPPPPGGGGGGGGGGGG'
Peptide is 'ALTAMCMNVWEITYHK'
Peptide is 'GSDVNR'
Peptide is 'R'
Peptide is 'ASFAQPPPQPPPPLLAIKPASDASD'
Peptide is 'AAAK'
Peptide is 'AAA'
Peptide is 'AAAK'
Peptide is 'AAA'
in which the central process might be documented (with the steps of the statement being taken right-to-left, bottom-to-top) as:
my @peptides = # 4. and the pieces are peptides.
map split($cleave), # 3. split at each cleavage point...
grep m{ $cleave }xms, # 2. that can be cleaved, ...
@proteins # 1. for each protein...
;
Update: After sober consideration, I changed the code example above to its present form. The previous code contained
my $cleave = qr{ [KR] (?! P) }xms;
...
map split(m{ (?<= $cleave) }xms),
grep m{ $cleave }xms,
...
which doesn't really respect the DRY principle, which is what I was aiming to exemplify.
Give a man a fish: <%-{-{-{-<
| [reply] [d/l] [select] |
Re: Bioinformatics: Regex loop, no output
by Anonymous Monk on Nov 15, 2015 at 17:48 UTC
|
#!/usr/bin/perl
# http://perlmonks.org/?node_id=1147731
use strict;
use warnings;
my @new_peptides;
my @proteins=qw(
DAAAAATTLTTTAMTTTTTTCKMMFRPPPPPGGGGGGGGGGGG
ALTAMCMNVWEITYHKGSDVNRRASFAQPPPQPPPPLLAIKPASDASD
);
for my $protein (@proteins) {
if ($protein =~ m/[KR](?!P)/) {
$protein =~ s/[KR]\K(?!P)/=/g;
push @new_peptides, split ('=',$protein);
}
}
for (@new_peptides) {
print "The peptide is $_\n";
}
| [reply] [d/l] |
Re: Bioinformatics: Regex loop, no output
by Laurent_R (Canon) on Nov 15, 2015 at 21:32 UTC
|
If you don't need to keep the K or R items on which to cut the string, you could try this (demonstrated under the Perl debugger):
DB<6> $p = 'ALTAMCMNVWEITYHKGSDVNRRASFAQPPPQPPPPLLAIKPASDASD';
DB<7> @b = split /[KR](?!P)/, $p;
DB<8> x \@b;
0 ARRAY(0x6004f0f88)
0 'ALTAMCMNVWEITYH'
1 'GSDVN'
2 ''
3 'ASFAQPPPQPPPPLLAIKPASDASD'
If you need to keep those K and R items, then this might be easier:
DB<10> p $p
ALTAMCMNVWEITYHKGSDVNRRASFAQPPPQPPPPLLAIKPASDASD
DB<11> $p = 'ALTAMCMNVWEITYHKGSDVNRRASFAQPPPQPPPPLLAIKPASDASD';
DB<12> $p =~ s/([KR])(?!P)/$1=/g;
DB<21> p $p
ALTAMCMNVWEITYHK=GSDVNR=R=ASFAQPPPQPPPPLLAIKPASDASD
In the latter case, you can then just split the resulting string on /=/.
| [reply] [d/l] [select] |
Re: Bioinformatics: Regex loop, no output
by Anonymous Monk on Nov 15, 2015 at 18:29 UTC
|
#! /usr/bin/perl -wl
my @proteins = qw(DAAAAATTLTTTAMTTTTTTCKMMFRPPPPPGGGGGGGGGGGG ALTAMCMN
+VWEITYHKGSDVNRRASFAQPPPQPPPPLLAIKPASDASD);
my %seen = map {$_ => 1} @proteins;
print "Peptide $_" for grep !$seen{$_}++,
map {split /[KR]\K(?!P)/} @proteins;
| [reply] [d/l] |
|
Something I don't understand here. The "uniquifying" action of the %seen hash acts to prevent the passage of un-split, whole, original proteins (like AAAAAA in the example below) from getting through the dataflow "pipe" into the output, but it also prevents duplicated split pieces (e.g., AAAAK AAAA below) from the input from being output. Is this bioinformatically useful?
c:\@Work\Perl\monks>perl -wMstrict -MData::Dump -le
"my @proteins = qw(AAAAKAAAA AAAAKAAAA AAAAAA);
;;
my %seen = map {$_ => 1} @proteins;
;;
print qq{Peptide '$_'}
for grep !$seen{$_}++, map {split /[KR]\K(?!P)/} @proteins;
;;
dd \%seen;
"
Peptide 'AAAAK'
Peptide 'AAAA'
{ AAAA => 2, AAAAAA => 2, AAAAK => 2, AAAAKAAAA => 1 }
Give a man a fish: <%-{-{-{-<
| [reply] [d/l] [select] |
|
|