Re: renaming 1000's of FASTA files
by moritz (Cardinal) on Jul 11, 2011 at 11:43 UTC
|
#!/usr/bin/perl
use strict;
use warnings;
use Bio::SeqIO;
use Data::Dumper;
my %seq_id;
open HEADER , "<FASTA.headers" or die $!;
while (<HEADER>){
chomp $_;
my $fasta_id = $_;
$fasta_id =~ s/_.*//g ;
$seq_id{$fasta_id} => $_;
}
my $infile = $ARGV[0] || die ("Please give me an input fasta file\n");
my $inseq = new Bio::SeqIO(-format => 'fasta',
-file => $infile);
while (my $seq_obj = $inseq->next_seq ) {
my $id = $seq_obj->id ;
chomp $id;
my $seq = $seq_obj->seq ;
if (exists ($seq_id{$id})) {
print ">";
print $seq_id{$fasta_id};
print "\n".$seq."\n";
}
}
| [reply] [Watch: Dir/Any] [d/l] |
|
if (exists ($seq_id{$id})) {
print ">";
print $seq_id{$fasta_id};
print "\n".$seq."\n";
}
be if (exists ($seq_id{$id})) {
print ">";
print $seq_id{$id};
print "\n".$seq."\n";
}
and
$seq_id{$fasta_id} => $_; be $seq_id{$fasta_id} = $_;
Update: Corrected print $seq_id{$id}; from print $id | [reply] [Watch: Dir/Any] [d/l] [select] |
Re: renaming 1000's of FASTA files
by Anonymous Monk on Jul 11, 2011 at 12:05 UTC
|
If the total data size will sensibly -fit- in memory at the same time, then you can use a hash as you are doing now.
Otherwise sort the files identically and write code that compares the two sorted files.
Or ... put the data into an SQLite database file, which is a flat-file requiring no server at all, and use queries. | [reply] [Watch: Dir/Any] |
|
| [reply] [Watch: Dir/Any] |
Re: renaming 1000's of FASTA files
by Cristoforo (Curate) on Jul 11, 2011 at 16:53 UTC
|
The script puts all the lines from the .txt file in a hash
It just keeps creating a new 1 element hash
%seq_id = ("$fasta_id" => "$_");
you would need to say instead
$seq_id{ $fasta_id } = $_;
However, moritz has shown the better solution (instead of reopening the file over and over).
| [reply] [Watch: Dir/Any] [d/l] [select] |
Re: renaming 1000's of FASTA files
by sundialsvc4 (Abbot) on Jul 11, 2011 at 20:30 UTC
|
I agree that SQLite is quite probably “the thing to do” here. I would put each of the sequences into a table, along with (in a separate column) whatever key value you happen to be looking for ... whatever (I’m not a biologist...) “identifies” the sequence, whether or not it is unique.
If there are many keys that might identify a particular sequence, “okay, big deal, you have a many-to-one relationship.”
Anyhow ... SQLite gives you a quite-robust SQL implementation, in a single file, without a server. (Really, the only “gotcha” that it has is transactions: when you are changing data, you need to be in a transaction while doing so, or you’ll really regret it speed-wise.)
Now, build non-unique indexes.
Once you have the data in this database form, I suspect that a lot of your “difficult processing” will be reduced to SQL queries that you might not even require a Perl program to run.
SQLite is quite the tool ... I really beat the heck out of it once, or tried to. It didn’t flinch. 31 million rows at one point, and it still wasn’t flinching.
| [reply] [Watch: Dir/Any] |
Re: renaming 1000's of FASTA files
by osbosb (Monk) on Jul 11, 2011 at 12:50 UTC
|
One thing you should always make sure is that for every open file handle you create, you should have a corresponding close. It's good practice.
Sorry I couldn't be more helpful.
| [reply] [Watch: Dir/Any] |
|
How is this good practice in Perl?
I'm genuinely asking this. I have found some good practices for me, like for example that every if block should also have an else block, even if that block just dies, because in the long run, I will need code in that block anyway. But I don't see what bugs closing a file prevents or what diagnostics it helps to create. I neglect checking the result of close(), but that's about the only "benefit" I see.
| [reply] [Watch: Dir/Any] [d/l] [select] |
|
$ cat /proc/sys/fs/file-max
386943
| [reply] [Watch: Dir/Any] [d/l] |
|
|
|
I neglect checking the result of close(), but that's about the only "benefit" I see.
close() is where you get the most useful information error-checking wise, especially on writing. Some people I've worked with have even gone as far as suggesting not to error check the opens or writes, just checking close() for efficiency and completeness.
| [reply] [Watch: Dir/Any] |
Re: renaming 1000's of FASTA files
by tokpela (Chaplain) on Jul 11, 2011 at 17:29 UTC
|
I would also add to the above posts that if you start to run into memory issues with the hash you are using, use DBM::Deep
| [reply] [Watch: Dir/Any] |
Re: renaming 1000's of FASTA files
by Cristoforo (Curate) on Jul 11, 2011 at 21:13 UTC
|
F0Z7V0F01A03EB_210
F0Z7V0F01A03EB_180
F0Z7V0F01A03EB_136
F0Z7V0F01A03EB_362
After closer examination of the text file with id's, I see that the same id has 4 different endings, _210, _180, _136, _362. Is this so or is it only how you provided the sample input file? If it is this way, how would you choose the one to use for an id of F0Z7V0F01A03EB? | [reply] [Watch: Dir/Any] [d/l] [select] |
|
F0Z7V0F01A03EB_210
F0Z7V0F01A03EB_180
F0Z7V0F01A03EB_136
F0Z7V0F01A03EB_362
and the sequence would be the same for all files, ie the sequence from the fasta file F0Z7V0F01A03EB
| [reply] [Watch: Dir/Any] [d/l] |
|
You realize you have a problem here?You keep a hash with key the ID without the ending of "_xxx" and value the ID with the ending. But a hash-element can have only ONE value, so you overwrite the older elements with the newer, unless you take care and store the IDs in an array and then store the arrayref as a value in the hash-element.
CountZero A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James
| [reply] [Watch: Dir/Any] |