ad23 has asked for the wisdom of the Perl Monks concerning the following question:
Hello Monks,
I have created a hash (from a given number of files), which has the following keys : values associated with it:
aw1:10
qs2:20
dd3:30
de4:10
hg5:30
dfd6:20
gf4:20
hgh5:30
hgy3:10
and so on...
There are only 3 values (10,20,30), for this hash. Furthermore, I have a file - "Sample.fa" (with header information) like:
>aw1
ATGCTAGATGCTAGCTAGCTAGCACTGAT
CGATGCTAGCGTAGTCAGCTGATGCTGTA
CGATGCTAGTCGTACG
>qs2
CGAGCTAGTCGTAGTCGTGATGCTGATTA
CGATGCTAGTCGTAGCTAGCTGATGCTGC
CGATGCTAGTCGTAGTC
>dd3
CGTAGTCGTAGTCGTAGTCGATGCTGATG
GCTAGTCGATGCTAGCTAGTCGATGCTGG
CGATGCTGAT
>de4
CGTAGTCGTAGTCGTACGTAGTCGTGAGT
CGATTATTTAGGAGGGACAAGGATAGTA
>hg5
CGTAGTCGTAGTCTAGTCGTGATGCTAGA
>dfd6
CGATGCTACGTACGTAGTCAGTCGTGATG
AATTAGAGCAGATAGAGGGGGAAAGGGTT
AAACCCC
>gf4
CGTAGTCAGTCTAGCTGATGTCGATGCTG
>hgh5
CATGCTAGTCGTAGTCGTAGTCGATGCTT
TTTTAAGGGAACCCCC
>hgy3
CCCCGGGTTTGGGAAAAGGGGGGGGATAG
I want to write the corresponding header and seq from Sample.fa, to 3 files (10.txt, 20.txt, 30.txt -- according to the hash key and value). Following is a snippet from my code :
open (FILE1, ">10.txt");
open (FILE2, ">20.txt");
open (FILE3, ">30.txt");
my @files = glob('Data/*.fa');
my %hash = &readFile();
foreach my $f(@files){
open FILE, $f or die "Cannot open : $!\n";
while (my $line = <FILE>){
chomp $line;
#print "$line\n";
if($line =~ /^>/){
my @head = split (/ /, $line);
my $name = substr($line, 1);
foreach my $key(keys %hash){
if($key eq $name){
if($hash{$key} == 10){
select FILE1;
}
if($hash{$key} == 20){
select FILE2;
}
if($hash{$key} == 30){
select FILE3;
}
}
}
}
}
}
Can someone point me in the right direction?
Thanks!
Re: Compare hash with arrays and print
by almut (Canon) on Jul 12, 2010 at 19:05 UTC
|
By setting $/ appropriately, you could read in entire records, which would make your life somewhat easier.
Also, there is no need to iterate over all keys of the hash to then just pick the one where $key eq $name:
foreach my $key(keys %hash){
if($key eq $name){
if($hash{$key} == 10){
...
A direct lookup $hash{$name} would serve the same purpose.
Here's a simplified demo using __DATA__:
#!/usr/bin/perl
open (FILE1, ">10.txt") or die $!;
open (FILE2, ">20.txt") or die $!;
open (FILE3, ">30.txt") or die $!;
my %hash = (aw1 => 10, qs2 => 20, dd3 => 30,
de4 => 10, hg5 => 30, dfd6 => 20,
gf4 => 20, hgh5 => 30, hgy3 => 10);
{
local $/ = "\n>"; # input record separator
while (<DATA>){
s/^>//mg;
my ($name) = /^([^\n]+)/;
if($hash{$name} == 10){
select FILE1;
}
if($hash{$name} == 20){
select FILE2;
}
if($hash{$name} == 30){
select FILE3;
}
print ">$_";
}
}
__DATA__
>aw1
ATGCTAGATGCTAGCTAGCTAGCACTGAT
CGATGCTAGCGTAGTCAGCTGATGCTGTA
CGATGCTAGTCGTACG
>qs2
CGAGCTAGTCGTAGTCGTGATGCTGATTA
CGATGCTAGTCGTAGCTAGCTGATGCTGC
CGATGCTAGTCGTAGTC
>dd3
CGTAGTCGTAGTCGTAGTCGATGCTGATG
GCTAGTCGATGCTAGCTAGTCGATGCTGG
CGATGCTGAT
>de4
CGTAGTCGTAGTCGTACGTAGTCGTGAGT
CGATTATTTAGGAGGGACAAGGATAGTA
>hg5
CGTAGTCGTAGTCTAGTCGTGATGCTAGA
>dfd6
CGATGCTACGTACGTAGTCAGTCGTGATG
AATTAGAGCAGATAGAGGGGGAAAGGGTT
AAACCCC
>gf4
CGTAGTCAGTCTAGCTGATGTCGATGCTG
>hgh5
CATGCTAGTCGTAGTCGTAGTCGATGCTT
TTTTAAGGGAACCCCC
>hgy3
CCCCGGGTTTGGGAAAAGGGGGGGGATAG
(just re-add the looping over your @files) | [reply] [d/l] [select] |
|
my %hash = &readFile();
my @fasta = glob('Data_Test/*.fa');
foreach my $f (@fasta){
open FILE, $f or die "Cannot open $fastaname for reading: $!\n";
local $/ = "\n>";
while (<FILE>)
{
s/^>//mg;
my ($name) = /^(\w+)/;
if($hash{$name} == 10){
select FILE1;
}
if($hash{$name} == 20){
select FILE2;
}
if($hash{$name} == 30){
select FILE3;
}
print ">$_";
}
}
close(FILE1);
close(FILE2);
close(FILE3);
| [reply] [d/l] |
|
Add some debugging prints (print STDERR ... preferably, as you're manipulating the default output file handle) to figure out what's different from the sample I've given (which does work fine for me).
For example, print the records ($_), $name, the corresponding hash values $hash{$name}, etc. Are they what you'd expect?
| [reply] [d/l] [select] |
|
|
|
|
update: oops I do see that you do have a print statement. and looks like it should work.
This "select FILE3;" statement by itself does nothing useful. I actually wouldn't use select at all in this situation. The Perl print statement is a "smart" critter. If the first arg of print is the file handle of some open file, Perl will print to that file handle. print FILE3 "abc"; will print "abc" to FILE3. The Perl default is essentially "print stdout "abc";". Select makes the default print go to wherever you want (instead of stdout), but here it appears easier to just put the file handle in the print statement.
print FILE3 $_; or similar will work fine. A plain "print;" sends $_ to the default file handle. With a file handle specified, I think you have to explicitly say $_ for the same effect.
| [reply] [d/l] |
Re: Compare hash with arrays and print
by biohisham (Priest) on Jul 12, 2010 at 19:42 UTC
|
Using Bio::SeqIO from BioPerl provides a powerful way of dealing with I/O operations on biological data files having the FastA format or any other format for that matter. The input file object is created as well as three output file objects, each one of these objects has information about the format to read from or write into and the file name to read from or direct the output to, in case the output file doesn't exist, it is created for you too.
Using an appropriate BioPerl interface will eliminate the need to construct regexes to detect sequence identifiers and strings, it will also allow you to flexibly migrate among different biological data formats on the go. That will add up to saving time focusing on the data manipulation tasks rather than coding techniques implementation...
#!/usr/local/bin/perl
#title "Compare hash with arrays and print"
use strict;
use warnings;
use Bio::SeqIO;
my %hash = (
aw1=>10,
qs2=>20,
dd3=>30,
de4=>10,
hg5=>30,
dfd6=>20,
gf4=>20,
hgh5=>30,
hgy3=>10,
);
my $file = "Sample.fa";
my $file10 = "10.fa";
my $file20 = "20.fa";
my $file30 = "30.fa";
my $seq = Bio::SeqIO->new(-file => "<$file", -format=>'fasta'); # inpu
+t object
#output objects
my $seqOut10 = Bio::SeqIO->new(-file => ">$file10", -format=>'fasta');
my $seqOut20 = Bio::SeqIO->new(-file => ">$file20", -format=>'fasta');
my $seqOut30 = Bio::SeqIO->new(-file => ">$file30", -format=>'fasta');
while(my $seqIn = $seq->next_seq()){
for my $key (keys %hash){
if($seqIn->id eq $key && $hash{$key}==10){
$seqOut10->write_seq($seqIn);
}elsif($seqIn->id eq $key && $hash{$key} == 20
+){
$seqOut20->write_seq($seqIn);
}elsif($seqIn->id eq $key && $hash{$ke
+y} == 30){
$seqOut30->write_seq($seqIn);
}
}
}
Excellence is an Endeavor of Persistence.
A Year-Old Monk :D .
| [reply] [d/l] [select] |
|
I'd just like to second this solution.
If you are working with sequence data a lot then it's definitely worth investing some time learning the ways of BioPerl.
In addition to its ability to handle a huge variety of data formats, it provides modules to interact with and run numerous sequence analysis packages such as BLAST, EMBOSS and RepeatMasker
Learning the potential of BioPerl combined with the Ensembl Perl API has stood me in very good stead over the years.
Best of luck with your project
| [reply] |
Re: Compare hash with arrays and print
by toolic (Bishop) on Jul 12, 2010 at 18:34 UTC
|
Can you clarify what problem you are having?
If the problem is that your 3 output files are empty, then you need to print something to them. The only print I see in your code has been commented out. | [reply] |
|
my $in = $ARGV[0];
my $x = $ARGV[1];
my $y = $ARGV[2];
open IN , "<$in" or die $!;
open X, ">$x" or die $!;
open Y, ">$y" or die $!;
while (<IN>) {
if (/^>/) {
if (/infor_present:/) {
select X;
}
else{
select Y;
}
}
print;
}
In the above code, there is a "print" statement after it comes out of loop. So in order to print the contents in multiple files, how can we use print in this manner?
Also, I tried making a hash for the Sample.fa files with header as key and seq as values. But since I am reading multiple files, that approach is not working (as I run out of memory!). Kindly help.
Thanks!
| [reply] [d/l] |
|
Doesn't select command selects the file and print the contents?
No, select alone will not print to a file. You must also use print, as you have done in your new code example.
| [reply] |
Re: Compare hash with arrays and print
by jwkrahn (Abbot) on Jul 12, 2010 at 19:02 UTC
|
my %hash = (
aw1 => 10,
qs2 => 20,
dd3 => 30,
de4 => 10,
hg5 => 30,
dfd6 => 20,
gf4 => 20,
hgh5 => 30,
hgy3 => 10,
);
{ # exchange number values for filehandles
open my $FILE1, '>', '10.txt' or die "Cannot open '10.txt' $!";
open my $FILE2, '>', '20.txt' or die "Cannot open '20.txt' $!";
open my $FILE3, '>', '30.txt' or die "Cannot open '30.txt' $!";
for my $val ( values %hash ) {
$val = $FILE1 if $val == 10;
$val = $FILE2 if $val == 20;
$val = $FILE3 if $val == 30;
}
}
local @ARGV = glob 'Data/*.fa';
my $name;
while ( my $line = <> ) {
if ( $line =~ /^>(.+)/ ) {
$name = $1;
}
print { $hash{ $name } } $_ if exists $hash{ $name };
}
| [reply] [d/l] |
Re: Compare hash with arrays and print
by Marshall (Canon) on Jul 12, 2010 at 21:36 UTC
|
I reversed the hash to get a unique list of the "numbers". Then opened that many files and put the file handles in another hash. This eliminates some of the "if" logic. Using different record separator does help, but a slight bit of fiddling is required.
update: See how different filehandles are used in the print below. Also if what reverse is unclear add a print Dumper(\%rhash); statement.
#!/usr/bin/perl -w
use strict;
use Data::Dumper;
my %hash = (aw1 => 10, qs2 => 20, dd3 => 30,
de4 => 10, hg5 => 30, dfd6 => 20,
gf4 => 20, hgh5 => 30, hgy3 => 10);
my %rhash = reverse %hash;
my %fh;
foreach my $num (keys %rhash)
{
open ($fh{"$num.txt"} , '>', "$num.txt") or die ;
}
$/ = '>';
while (<DATA>)
{
my $tag = (m/^(\w+)/)[0]; #"blank" rec at beginning
next unless defined $tag;
s/>$//;
my $filename = "$hash{$tag}.txt";
my $handle = $fh{$filename};
print $handle ">$_";
}
__DATA__
>aw1
ATGCTAGATGCTAGCTAGCTAGCACTGAT
CGATGCTAGCGTAGTCAGCTGATGCTGTA
CGATGCTAGTCGTACG
>qs2
CGAGCTAGTCGTAGTCGTGATGCTGATTA
CGATGCTAGTCGTAGCTAGCTGATGCTGC
CGATGCTAGTCGTAGTC
>dd3
CGTAGTCGTAGTCGTAGTCGATGCTGATG
GCTAGTCGATGCTAGCTAGTCGATGCTGG
CGATGCTGAT
>de4
CGTAGTCGTAGTCGTACGTAGTCGTGAGT
CGATTATTTAGGAGGGACAAGGATAGTA
>hg5
CGTAGTCGTAGTCTAGTCGTGATGCTAGA
>dfd6
CGATGCTACGTACGTAGTCAGTCGTGATG
AATTAGAGCAGATAGAGGGGGAAAGGGTT
AAACCCC
>gf4
CGTAGTCAGTCTAGCTGATGTCGATGCTG
>hgh5
CATGCTAGTCGTAGTCGTAGTCGATGCTT
TTTTAAGGGAACCCCC
>hgy3
CCCCGGGTTTGGGAAAAGGGGGGGGATAG
| [reply] [d/l] [select] |
|
my @keys = keys %{{ reverse %hash }};
my %fh;
foreach my $num ( @keys )
| [reply] [d/l] [select] |
|
| [reply] |
|
|