Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

replace FASTA sequence headers

by onlyIDleft (Scribe)
on Jun 10, 2012 at 12:07 UTC ( #975419=perlquestion: print w/replies, xml ) Need Help??

onlyIDleft has asked for the wisdom of the Perl Monks concerning the following question:

Dear monks,

I am trying to replace the headers of a multifasta file as input 1 by the set of new headers in input 2.

INPUT 1 could be for example:

>head1 \n seq1 \n >head2 \n seq2 \n

INPUT2 could be, for example

head1 \t headA \n headB \t headB \n

The FINAL output should be of the form

>headA \n seq1 \n >headB \n seq2 \n

This is the code I've have so far, but I have trouble when trying to use key value in hashes to extract out new header! Strange, what am I doing wrong?

note: I am a a very new to Perl programming

Thanks in advance to anyone who can help

use strict; use warnings; my @filenames = @ARGV; my @output; my (@headers_seqs, %head_seqs, %header_pairs, $destination); open(IN1, '<', $filenames[0]) or die "Can't read from multifasta file +with alternating lines of headers and sequences $_: $! \n"; open(IN2, '<', $filenames[1]) or die "Can't read from tab-delimited he +ader replacement file $_: $! \n"; while(<IN1>){ chomp; if ($_=~ m/\>/) #looks for match to the + '>' character { my $header = $_; # print $header, "\n"; $header =~ s/\>//; push @headers_seqs, $header; # print "header:", "\t", $header, "\n"; } elsif ($_!~ m/\>/) #looks for match to +the '>' character { my $seq = $_; # print "seq:", "\t", $seq, "\n"; push @headers_seqs, $seq; } } print $#headers_seqs, "\n"; %head_seqs = @headers_seqs; ###########################***********************-------------------- +--------------------***********************########################## +# my @head_orig_new; while(<IN2>){ chomp; my @line_splits = split('\s+',$_); #print "@line_splits", "\n"; my $orig_header = $line_splits[0]; #print $orig_header, "\n"; my $new_header = $line_splits[1]; #print $new_header, "\n"; push @head_orig_new, $orig_header; push @head_orig_new, $new_header; } #print $#head_orig_new, "\n"; %header_pairs = @head_orig_new; ###########################***********************-------------------- +--------------------***********************########################## +# foreach(keys %head_seqs) { push @output, '>', $header_pairs{$_}, "\n"; push @output, $head_seqs{$_},"\n"; } ###########################***********************-------------------- +--------------------***********************########################## +# $destination = $filenames[0].'_headers-replaced.fasta'; open (OUT, '>', $destination) or die "Can't write to file $destination +: $!\n"; print OUT @output; close IN1; close IN2; close OUT;

Replies are listed 'Best First'.
Re: replace FASTA sequence headers
by Athanasius (Bishop) on Jun 10, 2012 at 17:23 UTC

    Hello onlyIDleft,

    There are a number of problems with your code. For example, lines such as

    %head_seqs = @headers_seqs;

    make no sense. You need to read up on perl data structures to understand arrays and hashes.

    I think the following script does what you need:

    use strict; use warnings; open(my $in1, '<', $ARGV[0]) or die "Can't read from multifasta file '$ARGV[0]': $!"; my @multifasta = <$in1>; close $in1; open(my $in2, '<', $ARGV[1]) or die "Can't read from header replacement file '$ARGV[1]': $!"; my %headers; while (<$in2>) { my ($orig, $new) = split; warn "Duplicate replacement: $orig --> $new\n" if exists $headers{ +$orig}; $headers{$orig} = $new; } close $in2; my $destination = $ARGV[0] . '_headers-replaced.fasta'; open (my $out, '>', $destination) or die "Can't write to file '$destination': $!"; foreach my $line (@multifasta) { if ($line =~ /^\>/) # it is a header { foreach my $key (keys %headers) { if ($line =~ /$key/) { $line =~ s/$key/$headers{$key}/; last; } } } print $out $line; } close $out;

    HTH,

    Athanasius <°(((><contra mundum

      lines such as
      %head_seqs = @headers_seqs;
      make no sense.

      It makes perfect sense.    Copying values between lists, arrays and hashes is a normal idiom in Perl.

      if ($line =~ /$key/) { $line =~ s/$key/$headers{$key}/; last; }

      Your $key values may contain regular expression meta-characters so you should use quotemeta on them.

      Your loop exits on the first match but not necessarily the correct match.    You should anchor the patterns.

        Thanks for your replies jwkrahn.

        Actually, EACH of my values in the hash of new / old headers have # character in them, some of them have other special characters such as _ etc

        I was under the impression that because my keys are all letter/number strings, that there should not be any problem with having special characters in the values, BUT not in the keys themselves

        Am I missing something here about needing "quotemeta" for the hash's values as well?

        I hope I am explaining the header pairs properly

        Here is an example below:

        cluster14621 \t Helitron#element_1

      Thanks a lot Athanasius

      There are so many ideas that are new to me here, from just one script of yours! Thanks :)

Re: replace FASTA sequence headers
by jwkrahn (Monsignor) on Jun 10, 2012 at 19:34 UTC

    It sounds like you want something like this:

    use strict; use warnings; @ARGV == 2 or die "usage: $0 <multifasta file> <header replacement fil +e>\n"; my ( $fasta_file, $header_file ) = @ARGV; my $destination = $fasta_file . '_headers-replaced.fasta'; open IN2, '<', $header_file or die "Can't read from tab-delimited head +er replacement file $header_file: $!\n"; my %head_seqs; while ( <IN2> ) { chomp; my ( $old, $new ) = split /\t/; $head_seqs{ $old } = $new; } close IN2; open IN1, '<', $fasta_file or die "Can't read from multifasta file wit +h alternating lines of headers and sequences $fasta_file: $!\n"; open OUT, '>', $destination or die "Can't write to file $destination: +$!\n"; while ( <IN1> ) { if ( /^>(.+)$/ && exists $head_seqs{ $1 } ) { $_ = ">$head_seqs{ $1 }\n"; } print OUT $_; } close IN1; close OUT;
Re: replace FASTA sequence headers
by Cristoforo (Curate) on Jun 10, 2012 at 20:41 UTC
    If you are going to be working with fasta and genes, you might want to see BioPerl. A lot of code has been developed to handle Biogenetics problems. Here is how you might want to parse and print the fasta file using Bio::SeqIO (and Bio::Seq for methods like display_id).

    #!/usr/bin/perl use strict; use warnings; use Bio::SeqIO; my %lookup; open my $fh, "<", 'input2.txt' or die $!; while (<$fh>) { chomp; my ($id, $replace) = split /\t/; $lookup{$id} = $replace; } close $fh or die $!; my $in = Bio::SeqIO->new( -file => "input1.txt" , -format => 'fasta'); my $out = Bio::SeqIO->new( -file => '>test.dat', -format => 'fasta'); while ( my $seq = $in->next_seq() ) { if (defined(my $replace = $lookup{ $seq->id })) { $seq->display_id($replace); } $out->write_seq($seq); } __END__ *** input 1 >chr1 AACCCCCCCCTCCCCCCGCTTCTGGCCACAGCACTTAAACACATCTCTGC CAAACCCCAAAAACAAAGAACCCTAACACCAGCCTAACCAGATTTCAAAT TTTATCTTTAGGCGGTATGCACTTTTAACAAAAAANNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN GCCCATCCTACCCAGCACACACACACCGCTGCTAACCCCATACCCCGAAC CAACCAAACCCCAAAGACACCCCCCACAGTTTATGTAGCTTACCTCNNNN >chrM GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCAT TTGGTATTTTCGTCTGGGGGGTGTGCACGCGATAGCATTGCGAGACGCTG GAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTGCCTCATT CTATTATTTATCGCACCTACGTTCAATATTACAGGCGAACATACCTACTA AAGTGTGTTAATTAATTAATGCTTGTAGGACATAATAATAACAATTGAAT GTCTGCACAGCCGCTTTCCACACAGACATCATAACAAAANAATTTCCACC >GJKKTUG01DYDGC GGGTATTCCTTCTCCACCTTGCAGCTAACATCAGTGTTTCGTCTACTCAAGCACGCCAAC ACGCCCTAGAGCGCCCTGTCCAGGGGATGGCAACCAACTCTGACCCTGCAAGTGCAGCAG ACATGAGGAATACAAACTACAATCTTTTACTTGATGATGCAATGCCGGACAAACTCTAGA >F0Z7V0F01EDB3V AAGGCGAGNGGTATCACGCAGTAAGTTACGGTTTTCGGGTAACGCGTCNGNGGNACTAAC CCACGGNGGGTAACCCGTCNCTACCGGTATAGGACTAAGGTTACCGGAACGTCGTGGGGT ACCCCCCGGACGGGGACCGTCCCCTCATANAGTCAACNGTNTGAGATGGACTAACTCAAA CCTAGTTTCAAGTACTATTTAACTTACTTACGTTACCCGTAATTTCGGCGTTTAGAGGCG *** input 2 chr1 headA chrM headB
    This prints out (to file 'test.dat'):

    >headA AACCCCCCCCTCCCCCCGCTTCTGGCCACAGCACTTAAACACATCTCTGCCAAACCCCAA AAACAAAGAACCCTAACACCAGCCTAACCAGATTTCAAATTTTATCTTTAGGCGGTATGC ACTTTTAACAAAAAANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNGCCCATCCTACCCAGCACACACACACCGCTGCTAACCCCA TACCCCGAACCAACCAAACCCCAAAGACACCCCCCACAGTTTATGTAGCTTACCTCNNNN >headB GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTT CGTCTGGGGGGTGTGCACGCGATAGCATTGCGAGACGCTGGAGCCGGAGCACCCTATGTC GCAGTATCTGTCTTTGATTCCTGCCTCATTCTATTATTTATCGCACCTACGTTCAATATT ACAGGCGAACATACCTACTAAAGTGTGTTAATTAATTAATGCTTGTAGGACATAATAATA ACAATTGAATGTCTGCACAGCCGCTTTCCACACAGACATCATAACAAAANAATTTCCACC >GJKKTUG01DYDGC GGGTATTCCTTCTCCACCTTGCAGCTAACATCAGTGTTTCGTCTACTCAAGCACGCCAAC ACGCCCTAGAGCGCCCTGTCCAGGGGATGGCAACCAACTCTGACCCTGCAAGTGCAGCAG ACATGAGGAATACAAACTACAATCTTTTACTTGATGATGCAATGCCGGACAAACTCTAGA >F0Z7V0F01EDB3V AAGGCGAGNGGTATCACGCAGTAAGTTACGGTTTTCGGGTAACGCGTCNGNGGNACTAAC CCACGGNGGGTAACCCGTCNCTACCGGTATAGGACTAAGGTTACCGGAACGTCGTGGGGT ACCCCCCGGACGGGGACCGTCCCCTCATANAGTCAACNGTNTGAGATGGACTAACTCAAA CCTAGTTTCAAGTACTATTTAACTTACTTACGTTACCCGTAATTTCGGCGTTTAGAGGCG

    Chris

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://975419]
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others meditating upon the Monastery: (7)
As of 2020-07-09 18:30 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?