http://www.perlmonks.org?node_id=1000655

microkorg has asked for the wisdom of the Perl Monks concerning the following question:

Greetings Monks,
A very new Perl programmer having a touch of trouble getting a small program to loop through an array.
The file I have read into the array is a .fasta DNA sequence file and has the format:

>Sequence_Header_A
DNAsequence
>Sequence_Header_B
DNAsequence
>Sequence_Header_C
DNAsequence


etc.

I need a program to cycle through the array from top to bottom and perform various steps.
It needs to:
- Recognise a header line and capture the header name
- insert a newline into the DNAsequence at a pre-determined place (the "e" in this example)
- paste the header line before the newline, forming a new entry into the file
- increase a count on the header each time a new entry is made

Each time it finds a new header the count should reset. Hopefully, giving the output seen below:

>Sequence_Header_A_1
DNAs
>Sequence_Header_A_2
equ
>Sequence_Header_A_3
enc
>Sequence_Header_A_4
e
>Sequence_Header_B_1
DNAs
>Sequence_Header_B_2
equ
>Sequence_Header_B_3
enc
>Sequence_Header_B_4
e

I have managed to get the separate elements working (in a fashion) but I don't seem to be able to get it to work as one element (at the moment the "header capture" section and the "insert newline" section are completely independent. My code is as follows:

#Capture header foreach $line (@array) { if ($line =~ /^>.+$/ ) { $header = $&; } }

...and the section which "cuts" the DNA sequence at the appropriate section is:

$number = 1 ; foreach (@array) { #cut at appropriate sites $_ =~ s/(e)/\n$header_$number\n$1/ ; $number++; #remove double ">>" $_ =~ s/\*/\*\n/g; #remove blank lines $_ =~ s/\n+/\n/g ; }

The last 2 steps are just to tidy up the formatting as the output must be exactly in the format of my 2 examples at the top of this post.

Help please!
Until then, I will get my head down and carry on trying!

Thanks
Microkorg.

Replies are listed 'Best First'.
Re: Perl noob struggling to loop through an array
by aitap (Curate) on Oct 24, 2012 at 15:52 UTC
    Greetings, Microkorg.

    The file I have read into the array is a .fasta DNA sequence file
    Search for "fasta" gives many results, and nearly all threads contain links to already written code for FASTA manipulation: BioPerl http://bioperl.org fasta allBio::AlignIO::fasta

    Your code is OK, but please don't use $& because it can slow down your other regexps. Catch symbols using (regexp) and use them as $number_of_catch_group, like this:

    if ($line =~ /^>(.+)$/ ) { $header = $1; }
    WARNING: Once Perl sees that you need one of $&, "$`", or "$'" anywhere in the program, it has to provide them for every pattern match. This may substantially slow your program. Perl uses the same mechanism to produce $1, $2, etc, so you also pay a price for each pattern that contains capturing parentheses. (To avoid this cost while retaining the grouping behaviour, use the extended regular expression "(?: ... )" instead.) But if you never use $&, "$`" or "$'", then patterns without capturing parentheses will not be penalized. So avoid $&, "$'", and "$`" if you can, but if you can't (and some algorithms really appreciate them), once you've used them once, use them at will, because you've already paid the price. As of 5.005, $& is not so costly as the other two.
    Seen at perlre, right before Quoting metacharacters.

    Sorry if my advice was wrong.
Re: Perl noob struggling to loop through an array
by jwkrahn (Abbot) on Oct 25, 2012 at 03:35 UTC
    $ echo ">Sequence_Header_A DNAsequence >Sequence_Header_B DNAsequence >Sequence_Header_C DNAsequence" | perl -e' while ( <> ) { next unless /^>/; ( my $format = $_ ) =~ s/\n/_%d\n%s\n/; chomp( my @data = split /(?=e)/, <> ); my $count; printf $format, ++$count, $_ for @data; } ' >Sequence_Header_A_1 DNAs >Sequence_Header_A_2 equ >Sequence_Header_A_3 enc >Sequence_Header_A_4 e >Sequence_Header_B_1 DNAs >Sequence_Header_B_2 equ >Sequence_Header_B_3 enc >Sequence_Header_B_4 e >Sequence_Header_C_1 DNAs >Sequence_Header_C_2 equ >Sequence_Header_C_3 enc >Sequence_Header_C_4 e
      Thanks for such quick responses aitap and jwkrahn,
      I am happy with the way my code cuts up the data but I am having a more fundamental problem with getting the loop to perform as I wish.

      Could you suggest a way of getting the 2 sections of my own code into a single loop (I realise that jwkrahn suggested a slightly different method, and this is very much appreciated) but I would quite like to get my own technique running smoothly as well.

      Thanks again for all your help.

      Hi jwkrahn, would you mind putting some comments into your code explaining each section? I am struggling to follow it through. Thanks, microkorg