Think about Loose Coupling | |
PerlMonks |
Re^3: Comparing array of aligned sequencesby johngg (Canon) |
on Jul 11, 2014 at 21:19 UTC ( [id://1093314]=note: print w/replies, xml ) | Need Help?? |
Given the number of sequence lines in your files processing line by line would be the better option. Reading the first sequence into a string that we will compare with subsequent sequences one at a time, modifying the string as we go would be the approach I'd now adopt. The flagDiffs() subroutine takes copies of the base sequence string and the next sequence read from file as arguments and XORs them. The resultant string will have \0 (NULL) characters wherever characters in the two sequences matched. It then uses a regular expression and pos to find non-NULL characters, i.e. non-matches. Finally it modifies the base sequence string by substituting an 'X' at the positions where there were non-matches and returns it, the returned sttring being assigned back to the base sequence string. Here is a command-line example of XOR'ing two strings to demonstrate the process.
Once all lines have been processed the base sequence string can be split on one or more 'X' characters to find the consensus strings.
The output.
Without a 25,000 sequence file to test on I don't know whether this approach will perform but it seems to give the expected results with the sample in your OP. I hope this is helpful. Update: Added the command-line XOR example. Update 2: There's no need to keep updating the base sequence string as we go, just keep a record, as keys of a hash, of where differences are found and modify it after all lines have been processed. Also there's no need to chomp each line (as long as all the sequences are the same length), just do it at the end to the base string. If sequences differ in length then you are into pre-processing to find the shortest or longest sequence then either truncating to the shortest or padding to the longest. New code, the output is the same.
Cheers, JohnGG
In Section
Seekers of Perl Wisdom
|
|