Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??
Hello,

I used BioPerl, Bio::SeqIO to load the fasta sequences all into a long string. Then I tested for kmers that only appeared once and printed them out.

The results agree with your desired output.

#!/usr/bin/perl use strict; use warnings; use Bio::SeqIO; my $in = Bio::SeqIO->new( -file => "dmel-all-chromosome-r6.18.fasta +" , -format => 'fasta'); my $fasta; while ( my $seq = $in->next_seq() ) { $fasta .= $seq->seq; } my %seen; for my $i (0 .. length($fasta) - 21) { my $kmer = substr $fasta, $i, 21; next unless substr($kmer, -2) eq 'GG'; my $match = substr($kmer, -12); $seen{$match}{count}++; $seen{$match}{kmer} = $kmer; } my $crispr; for my $key (keys %seen) { next unless $seen{$key}{count} == 1; print "crispr_", ++$crispr, "\n"; print $seen{$key}{kmer}, "\n"; } __END__ *** ouput crispr_1 TTTAGACTCCCCTTGTACAGG crispr_2 TCTTCAGTCTCCAGTCTCCGG crispr_3 TTGCGTTGCGGAGCATACTGG crispr_4 TGCCACCAGTGGTTCCAAGGG crispr_5 TTATGTTTGTACGAGGGGGGG crispr_6 TCTCTTTGGTTTACGGATGGG crispr_7 TTGGCAAGGAGACGGTCCTGG crispr_8 TGAATTAAAGCTTGCGCGAGG crispr_9 GGAAGAGGCATCAACGAGGGG crispr_10 TGCAGCGGCCTAACAAGGCGG crispr_11 CTGCCCGATCCTAACTCCAGG crispr_12 ATATATGTTTGACCGTCGGGG ... crispr_126892 TTGCTTGGCACTAAGGCGGGG crispr_126893 CACCAAAAAGGACTTGCGTGG crispr_126894 GTGCCCCTCACTCATGCGGGG crispr_126895 TAAAAAGCGACGCAGTATTGG crispr_126896 CCGGATTTCTTCGTACAGGGG crispr_126897 GGTGGCTATGCTATGGTACGG crispr_126898 CTGCGTTGATGTTAGGTAGGG crispr_126899 GCTGGGACCCGAATACGTAGG

This program runs in under 2 minutes with the string of fasta characters about 145M. From your specs it looks like you want to combine the sequences into one string to test for kmers that only occur 1 time.

Update: Just realized how odd it was for my results to agree with the ones you posted. The order of my results were in the (un)ordered keys from the hash. It is curious why the order I got agreed with the order in your post!

Also, I think the reason you were getting 1 million results rather than 250,000 is you are testing for uniqueness on the whole 21 char windowsize instead of testing just the last 12 chars (including the 'GG' ending). There would be more unique 21 char kmers than unique 12 char kmers.


In reply to Re: unique sequences by Cristoforo
in thread unique sequences by Anonymous Monk

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others avoiding work at the Monastery: (5)
As of 2024-04-23 17:04 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found