Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

program to look for specific K-mer sequence

by pearllearner315 (Acolyte)
on Apr 20, 2017 at 16:46 UTC ( [id://1188439]=perlquestion: print w/replies, xml ) Need Help??

pearllearner315 has asked for the wisdom of the Perl Monks concerning the following question:

Hello Perl Monks, The same question was asked here:

http://stackoverflow.com/questions/40414018/perl-program-to-look-for-k-mer-with-specific-sequence

but seems like it wasn't answered correctly or wasn't finished.

I need to build upon a previous program that I've written and find the first 1000 unique, k-mers of length 23 that end with GG from a Fasta file.

The first 20 lines of the Fasta file look like this:
> >2L type=chromosome_arm; loc=2L:1..23011544; ID=2L; dbxref=REFSEQ: +NT_033779,GB:AE014134; MD5=bfdfb99d39fa5174dae1e2ecd8a231cd; length=2 +3011544; release=r5.54; species=Dmel; CGACAATGCACGACAGAGGAAGCAGAACAGATATTTAGATTGCCTCTCAT TTTCTCTCCCATATTATAGGGAGAAATATGATCGCGTATGCGAGAGTAGT GCCAACATATTGTGCTCTTTGATTTTTTGGCAACCCAAAATGGTGGCGGA TGAACGAGATGATAATATATTCAAGTTGCCGCTAATCAGAAATAAATTCA TTGCAACGTTAAATACAGCACAATATATGATCGCGTATGCGAGAGTAGTG CCAACATATTGTGCTAATGAGTGCCTCTCGTTCTCTGTCTTATATTACCG CAAACCCAAAAAGACAATACACGACAGAGAGAGAGAGCAGCGGAGATATT TAGATTGCCTATTAAATATGATCGCGTATGCGAGAGTAGTGCCAACATAT TGTGCTCTCTATATAATGACTGCCTCTCATTCTGTCTTATTTTACCGCAA ACCCAAATCGACAATGCACGACAGAGGAAGCAGAACAGATATTTAGATTG CCTCTCATTTTCTCTCCCATATTATAGGGAGAAATATGATCGCGTATGCG AGAGTAGTGCCAACATATTGTGCTCTTTGATTTTTTGGCAACCCAAAATG GTGGCGGATGAACGAGATGATAATATATTCAAGTTGCCGCTAATCAGAAA TAAATTCATTGCAACGTTAAATACAGCACAATATATGATCGCGTATGCGA GAGTAGTGCCAACATATTGTGCTAATGAGTGCCTCTCGTTCTCTGTCTTA TATTACCGCAAACCCAAAAAGACAATACACGACAGAGAGAGAGAGCAGCG GAGATATTTAGATTGCCTATTAAATATGATCGCGTATGCGAGAGTAGTGC CAACATATTGTGCTCTCTATATAATGACTGCCTCTCATTCTGTCTTATTT TACCGCAAACCCAAATCGACAATGCACGACAGAGGAAGCAGAACAGATAT
And the expected outcome should be in this format:
>crispr_1 GGGTGGAGCTCCCGAAATGCAGG >crispr_2 TTAATAAATATTGACACAGCGGG >crispr_3 ATCGTGGGGCGTTTTGTGAAAGG >crispr_4 AGTTTTTCACATAATCAGACAGG >crispr_5 GTGTTGGATGAGTGTCCTCTGGG >crispr_6 ATAGGTTGGTTGTTTTAAAAGGG >crispr_7 AAATTTTTGTTGCCACTGAATGG >crispr_8 AAGTTTCGAACTACGATGGTTGG >crispr_9 CATGCTTTGTGGAAATAAGTCGG >crispr_10 CACAGTGGGTGTTTGCACCTCGG .....and so on
It should be the first 1000 unique sequences.

This is what I have so far:
#!/usr/bin/perl use warnings; use strict; my %windowSeqCount = (); my $sequenceRef = loadSequence("input.fasta"); #writing to a new file open (UNIQUEKMERS,">",output.fasta') or die $!; my $windowSize = 23; my $stepSize = 1; for( my $windowStart = 0 ; $windowStart <= ( length($$sequenceRef) - $windowSize ); $windowStart += $stepSize ) { my $windowSeq = substr ( $$sequenceRef, $windowStart, $windowSize); $windowSeqCount{$windowSeq}++ if $windowSeq =~ /GG$/; } for (keys %windowSeqCount){ print UNIQUEKMERS $_, "\t", $windowSeqCount{$_}, "\n"; } sub loadSequence { my ($sequenceFile) = @_; my $sequence = ""; unless ( open( FASTA, "<", $sequenceFile ) ) { die $!; } while (<FASTA>){ my $line = $_; chomp ($line); if ($line !~ /^>/ ) { $sequence .= $line; } } return \$sequence; }
I know my current print section is incorrect and that it will find 23 length sequences ending with GG and store them in a hash with the number of occurences as values.
I can't seem to figure what I need to include in my code to find just the first 1000 unique sequences of length 23 ending in GG in the expected format.
Any help would be greatly appreciated!! Thank you!!!

Replies are listed 'Best First'.
Re: program to look for specific K-mer sequence
by AnomalousMonk (Archbishop) on Apr 20, 2017 at 17:53 UTC

    In addition to what Eily wrote, please bear in mind that this is PerlMonks and not BioMonks: what's a k-mer?

    A consultation of Wikipedia turned up this. IIUC, the 25-base sequence  GGGGGGGGGGGGGGGGGGGGGGGGG has three overlapping 23-base k-mers ending in GG: the 23-base substrings starting at offsets 0, 1 and 2.

    The cannonical regex approach to extracting overlapping patterns is this:

    c:\@Work\Perl\monks\pearllearner315>perl -wMstrict -le "my $seq = 'CCCAAAAAAAAAAAAAAAAAAAAAGGTTGGCCGGAAA'; my @k_mers = $seq =~ m{ (?= (.{21} GG)) }xmsg; print qq{'$_'} for @k_mers; " 'AAAAAAAAAAAAAAAAAAAAAGG' 'AAAAAAAAAAAAAAAAAGGTTGG' 'AAAAAAAAAAAAAGGTTGGCCGG'
    You want only unique k-mers and only the first thousand. One approach (assuming you have already extracted the entire, contiguous base sequence from each FASTA record):
    c:\@Work\Perl\monks\pearllearner315>perl -wMstrict -le "my $seq = 'GGGGGGGGGGGGGGGGGGGGGGGGGGGCCCAAAAAAAAAAAAAAAAAAAAAGGTTGGC +CGGAAA'; ;; my $rx_kmer = qr{ .{21} GG }xms; ;; my @k_mers; my %seen; ;; KMER: while ($seq =~ m{ (?= ($rx_kmer)) }xmsg) { push @k_mers, $1 unless $seen{$1}++; last KMER if @k_mers >= 1000; } ;; print qq{'$_'} for @k_mers; " 'GGGGGGGGGGGGGGGGGGGGGGG' 'AAAAAAAAAAAAAAAAAAAAAGG' 'AAAAAAAAAAAAAAAAAGGTTGG' 'AAAAAAAAAAAAAGGTTGGCCGG'

    Update 1: Eliminated entirely unnecessary  $count variable from final code example.

    Update 2: Here's a slightly more elegant approach if you're not going to be extracting millions of k-mers (see List::MoreUtils::uniq):

    c:\@Work\Perl\monks\pearllearner315>perl -wMstrict -le "use List::MoreUtils qw(uniq); ;; my $seq = 'GGGGGGGGGGGGGGGGGGGGGGGGGGGCCCAAAAAAAAAAAAAAAAAAAAAGGTTGGC +CGGAAA'; ;; my $rx_kmer = qr{ .{21} GG }xms; ;; my @k_mers = uniq $seq =~ m{ (?= ($rx_kmer)) }xmsg; $#k_mers = 999 if $#k_mers >= 1000; ;; print qq{'$_'} for @k_mers; " 'GGGGGGGGGGGGGGGGGGGGGGG' 'AAAAAAAAAAAAAAAAAAAAAGG' 'AAAAAAAAAAAAAAAAAGGTTGG' 'AAAAAAAAAAAAAGGTTGGCCGG'

    Update 3: If you also need the offset of each extracted k-mer subsequence, there are ways to do that, too.


    Give a man a fish:  <%-{-{-{-<

Re: program to look for specific K-mer sequence
by Eily (Monsignor) on Apr 20, 2017 at 17:12 UTC

    Hello Perl Monks, The same question was asked here:
    http://stackoverflow.com/questions/40414018/perl-program-to-look-for-k-mer-with-specific-sequence
    but seems like it wasn't answered correctly or wasn't finished.
    Well thanks for providing that link :). What do you mean by "seems like it wasn't answered correctly" though? You're not sure about whether the answer is there or not?

    It would be helpful if you actually provided the real expected output from the input you give, not "a hint", so that you could tell us exactly how the output you do obtain differs.

    If I understand you correctly, you have two issues: only display unique sequences, and limit to the first 1000. There is more than one way to do it, because this is perl, but one way you could do that is store both the number of occurences and the position for each sequence:

    if ($windowSeq =~ /GG$/) { $sequences{$windowSeq}{Count}++; $sequences{$windowSeq}{Position} = $windowStart; }
    You can then filter the sequences that have a Count equal to 1 using grep: my @uniq = grep { $sequences{$_}{Count} == 1 } keys %sequences;
    and then sort by position: my @ordered = sort { $sequences{$a}{Position} <=> $sequences{$b}{Position} } @uniq;

    Taking the first 1000 sequences from that array is left as an exercise to the reader ;-).

    Edit: I don't know if you want to find overlapping sequences (like in AAAGGG, you either have AAAGG or AAGGG as 5 letter sequences ending in GG), but if not, the logic should be :

    if ($windowSeq =~ /GG$/) { $windowStart += $windowSize; } else { $windowStart += $stepSize; }

Re: program to look for specific K-mer sequence
by johngg (Canon) on Apr 20, 2017 at 22:15 UTC

    An approach that uses substr, map and grep rather than regexen.

    use strict; use warnings; open my $inFH, q{<}, \ <<EOF or die $!; > >2L type=chromosome_arm; loc=2L:1..23011544; ID=2L; dbxref=REFSEQ: +NT_033779,GB:AE014134; MD5=bfdfb99d39fa5174dae1e2ecd8a231cd; length=2 +3011544; release=r5.54; species=Dmel; CGACAATGCACGACAGAGGAAGCAGAACAGATATTTAGATTGCCTCTCAT TTTCTCTCCCATATTATAGGGAGAAATATGATCGCGTATGCGAGAGTAGT GCCAACATATTGTGCTCTTTGATTTTTTGGCAACCCAAAATGGTGGCGGA TGAACGAGATGATAATATATTCAAGTTGCCGCTAATCAGAAATAAATTCA TTGCAACGTTAAATACAGCACAATATATGATCGCGTATGCGAGAGTAGTG CCAACATATTGTGCTAATGAGTGCCTCTCGTTCTCTGTCTTATATTACCG CAAACCCAAAAAGACAATACACGACAGAGAGAGAGAGCAGCGGAGATATT TAGATTGCCTATTAAATATGATCGCGTATGCGAGAGTAGTGCCAACATAT TGTGCTCTCTATATAATGACTGCCTCTCATTCTGTCTTATTTTACCGCAA ACCCAAATCGACAATGCACGACAGAGGAAGCAGAACAGATATTTAGATTG CCTCTCATTTTCTCTCCCATATTATAGGGAGAAATATGATCGCGTATGCG AGAGTAGTGCCAACATATTGTGCTCTTTGATTTTTTGGCAACCCAAAATG GTGGCGGATGAACGAGATGATAATATATTCAAGTTGCCGCTAATCAGAAA TAAATTCATTGCAACGTTAAATACAGCACAATATATGATCGCGTATGCGA GAGTAGTGCCAACATATTGTGCTAATGAGTGCCTCTCGTTCTCTGTCTTA TATTACCGCAAACCCAAAAAGACAATACACGACAGAGAGAGAGAGCAGCG GAGATATTTAGATTGCCTATTAAATATGATCGCGTATGCGAGAGTAGTGC CAACATATTGTGCTCTCTATATAATGACTGCCTCTCATTCTGTCTTATTT TACCGCAAACCCAAATCGACAATGCACGACAGAGGAAGCAGAACAGATAT EOF do { my $discard = <$inFH> }; my $seq = join q{}, map { chomp; $_ } <$inFH>; close $inFH or die $!; my $ct = 0; my @kmersGG = do { my %seen; grep { not $seen{ $_ } ++ } grep { q{GG} eq substr $_, 21 } map { substr $seq, $_, 23 } 0 .. length( $seq ) - 23; }; printf qq{>crispr_%d\n%s\n}, ++ $ct, $_ for @kmersGG > 1000 ? @kmersGG[ 0 .. 999 ] : @kmersGG;

    The output.

    I hope this is of interest.

    Cheers,

    JohnGG

Re: program to look for specific K-mer sequence
by tybalt89 (Monsignor) on Apr 21, 2017 at 02:49 UTC

    It would be helpful to have several test cases with actual input and actual matching output for each case.

    #!/usr/bin/perl # http://perlmonks.org/?node_id=1188439 use strict; use warnings; $_ = join('', grep /^[\sACGT]+$/, <DATA>) =~ tr/ACGT//cdr; my $count = 0; while( /(?=(.{21}GG))/g and $count < 1000 ) { /(?=$1).+$1/ or print ">crispr_@{[++$count]}\n$1\n"; } __DATA__ > >2L type=chromosome_arm; loc=2L:1..23011544; ID=2L; dbxref=REFSEQ: +NT_033779,GB:AE014134; MD5=bfdfb99d39fa5174dae1e2ecd8a231cd; length=2 +3011544; release=r5.54; species=Dmel; CCTCTCATTTTCTCTCCCATATTATAGGGAGAAATATGATCGCGTATGCG AGAGTAGTGCCAACATATTGTGCTCTTTGATTTTTTGGCAACCCAAAATG TAAATTCATTGCAACGTTAAATACAGCACAATATATGATCGCGTATGCGA GAGTAGTGCCAACATATTGTGCTAATGAGTGCCTCTCGTTCTCTGTCTTA TATTACCGCAAACCCAAAAAGACAATACACGACAGAGAGAGAGAGCAGCG GAGATATTTAGATTGCCTATTAAATATGATCGCGTATGCGAGAGTAGTGC CAACATATTGTGCTCTCTATATAATGACTGCCTCTCATTCTGTCTTATTT TACCGCAAACCCAAATCGACAATGCACGACAGAGGAAGCAGAACAGATAT

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1188439]
Approved by herveus
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (4)
As of 2024-04-25 20:49 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found