http://www.perlmonks.org?node_id=1131246

onlyIDleft has asked for the wisdom of the Perl Monks concerning the following question:

Hi folks

I seek your help in verifying whether the following code intends what I want it to do, please.

The input file has multiple alternating lines of IDs formatted as >ID1 and DNA sequences as strings of A/T/G/C. I want to deal with each pair of ID followed by its sequence, at a time. For a given sequence, I want to randomly shuffle the sequence in a window of length = 10^6. The move on to the next window until that sequence has been operated upon. Store this shuffled sequence under the same ID. Then move into the next pair of ID and its corresponding sequence. Process until all pairs of IDs and sequences are processed.

use strict; use warnings; use List::Util 'shuffle'; # Idea from http://www.perlmonks.org/?node_i +d=199901 my $input = shift @ARGV; open(IN, '<', $input) or die "Can't read multifasta input genomic DNA +file $input : $!\n"; my $window = 1000000; # hard coded for shuffle window to be 1MB i.e 10 +^6 my (@IDsSeqs,$seq_id, $seq, @output); while (<IN>) { chomp; push @IDsSeqs, $_; } my $index = 0; while ($index <= $#IDsSeqs) { $seq_id = $IDsSeqs[$index]; $seq = $IDsSeqs[$index+1]; my $final_seq; for (my $i = 1; $i <= length $seq; $i=$i+$window ) { my $s = substr ($seq, $i - 1, $window); my @temp_seq_array = split //, $s; my $rep = 1; while($rep <=10) { @temp_seq_array = shuffle @temp_seq_array; # using the Lis +t::Util module AND Shuffles EACH windown 10 times!!! $rep++; } my $rand_shuffled_seq = join ('', @temp_seq_array,); $final_seq = $final_seq.$rand_shuffled_seq; # concatenates + the shuffled DNA seq to the 3' end of the previous 1MB fragment } push @output, $seq_id, "\n",$final_seq,"\n"; $index = $index + 2; # process every alternate line with ID (and its c +orresponding sequence in next line) } my $destination = $input."_1MBWindow_ListUtilshuffle.fasta"; open(OUT, '>', $destination) or die "Can't write to file $destination: + $!\n"; print OUT @output; close OUT;

The reason I ask for your stamp of approval that my script is OK is to be doubly sure I am not making a mistake here. The downstream results I am getting from these randomly shuffled DNA sequences is SO VERY different from what has been published that if/when I claim that those results are wrong, the burden of proof lays heavy on me, and I want to be extra careful making such claims!

UPDATE 1

Thank you for all your responses. All useful. I see that there is no point in 10X shuffling, and I see the logic for why that is the case. JohnGG thanks for re-working the script. I will use that version now. BrowserUK, yes, I can exchange private messages if you are OK with that

But before future runs for shuffling the DNA sequences randomly, may be some more detail may help you help me better. This is more to do with logic than actual scripting... The "element" I am identifying in the genome has two ends. These two ends are each around 50-100 long. The separation between these ends is between 200 and 20,000. The length of the ends, and the separation between them is NOT known a priori. So, I wonder if there are two size ranges or periodicities whose non randomness needs to be broken down, with two successive shuffles, one in the ~ 100 length window and again in a ~1000 length window? Or some other window lengths? Or should I seed the shuffle start position differently? I am confused about this....

A question more related to the Perl modules. Would folks in the know about random shuffling point me to where I can simply understand any differences between List::Util qw(shuffle) and Array::Shuffle qw(shuffle_array). Or point to me whether they are same in algorithm, just different in any other way...

UPDATE 2

The identification of the two ends is NOT deterministic, but based on finding a pattern, where each character in the pattern could be A or T or G or C or some combination of those 2, 3 or all 4 letters. when you search for a perfect match, and when even 1 character gets randomly shuffled you cannot detect it any more, so that is straight forward, right? But here we are NOT looking for JUST perfect matches, but there exists degeneracy in each matching position, so that when you shuffle out one letter and replace it with another, it might still be a match, albeit a different one that "scores" better or worse - that makes it a bit more complicated IMO, doesn't it? This is the actual scenario. I had forgotten to mention this earlier, apologies! How does this change things when it comes to shuffling and periodicities I was referring to in Update 1? THANK YOU! :)

UPDATE 3

Hi Laurent and BrowserUK et al. - thank you for your replies. I am not trying to match the sequences to the query myself, the software I am using does it and returns the score. And BrowserUK's claim that genomicists have no use of score is not true to say the least. While it may be true that for some tasks, it might become necessary from a practical point of view, to impose a threshold score, it doesn't mean scores do not matter. Quite the contrary.... prime case in point - the BLAST tool. OK, I am going on a tangent, so back to the question at hand:

BrowserUK, I am sure understand your point about randomness being randomness, so my bone of contention or argument is NOT about how many numbers of shuffles, because it is now clear to me that doing it multiple times is not better, but a waste of time, no argument there

I am getting more numbers of matches in the shuffled DNA sequences than in the intact DNA sequences! This is an absurd result, which can only mean that there is so much noise being picked up by the software and reported as a match from intact sequences. When the sequence is shuffled, there is more noise that is being generated due to the shuffle per se, and since the software does not do a good job of discrimination, it reports a higher number of matches! This is the only conclusion I can arrive at based on my discussion with you folks. DNA random shuffling that I am doing is not a problem, the approach the software used in separating signal from noise is not efficient enough, is what I am thinking....

Unless any of you folks have a different opinion about my conclusion regarding the software I am using, and why it is reported more number of matches despite DNA sequence random shuffling, I consider this matter closed. Thanks to all who participated in this discussion, I am grateful for your inputs, suggestions, thoughts and code. Cheers!

(hopefully) FINAL UPDATE

In response to most recent post by BrowserUK, there is nothing in that post that I disagree with. However, the main point I am making regarding the software I am using, might have been lost on readers. So I will reiterate: When DNA sequence is shuffled, there will be some loss of "signal". Signal here meaning millions of years of selection for meaningful sequences. So more signal and less noise in intact genomes. And less signal and more noise in shuffled genomes. I dont think there is going to be any disagreement here. The problem arises - as I see it, because the software I am using does not have sufficient "discriminatory power" to efficiently separate signal from noise. If it could, then it would report more true matches proportional to signal in intact genomes, and fewer true matches proportional to reduced signal in shuffled genomes. But in reality, the behavior is opposite. So this is a limitation of the software, or the training set used on it, or the library of sequences that it is searching for or some combination thereof, and not of the shuffling at all. I think it would be very hard to argue against my conclusions. Whether you are a genomicist or not :). Thank you all for your inputs!

LATEST UPDATE & SEEKING ADVICE REGARDING WHERE TO SEEK HELP WITH JAVA PROGRAM

I requested a friend to check my results on the intact genomes which for me are different from the published results that talk about this bioinformatic software. He gets totally different results from me or the published work. I dont know what machine the authors used, I used a UBUNTU based compute cluster managed via SLURM, my friend used a local UNIX machine with 32GB RAM. This software is Java based. So the executable is a *.jar file. Since there is some issue with even intact genomes returning different numbers for different users / machines, may be FDR > 100% on randomly shuffled genome is may be an artifact of the software?!#$%^& Not sure!!! But I looked up the internet to look for hints. May be these are related to the sort of problem I am having? http://bit.ly/1BIIoiv, http://bit.ly/1LqQ8Zm. Software is at : http://bit.ly/1fvIz6y. Where do you advice me to seek help about whether the software is messed up or not. Asking authors has not elicited a useful reply, and I dont think it will in the future. Is there a Java forum like this is for Perl users? Thanks in advance!

Replies are listed 'Best First'.
Re: Random shuffling
by Laurent_R (Canon) on Jun 20, 2015 at 13:26 UTC
    Prima facie, the code looks right. Not sure you need to shuffle 10 times over, but why not? But you haven't really said what you aim is. The shuffle function of the List::Util module is supposed to return the input values in a random order. If your random order is not the same as the one in published sources, this is good news, isn't it? If you obtained the same result, then this would be a strong indication that the random function is not right.

    Or, said in other words, if your random output is not the same as the random output of someone else, then it tends to confirm that this is really random (or, at least, that there is no proof that it is misbehaving). But the fact that your random output is different from someone else random output does not mean that your result is any better or worse than someone else's result.

    I think you need to explain more what you are trying to do.

    Or did I entirely miss your point? I am not a biologist and my knowledge of genetics essentially dates back from the mid-1970s, so a very long time ago both in terms of what I remember about it and in terms of the progresses made in between.

    Then, there is the question of whether the (pseudo-)random order provided by shuffle is good enough to suit your purposes. This is a very difficult question and, besides the fact that I do not know your purposes, I just don't know how good shuffle is. Donald Knuth's The Art of of Computer Programming, Vol. 2, has almost 200 pages on random numbers generation, I think that this shows that this is a rather difficult problem.

      Thank you, that was an insightful reply

      First time I tried to shuffle, I used a different Perl module, but exact same syntax otherwise. That module is

      use Array::Shuffle qw(shuffle_array shuffle_huge_array); # from https://metacpan.org/source/SALVA/Array-Shuffle-0.03/samples/benchmark.pl

      I got inexplicable results using that module to randomly shuffle the sequences. That one is supposed to use the Fisher Yates Knuth algorithm. I contacted the author of the Perl module, and he suggested I cross verify using another Perl module. So using List::Util now.

        And can you say what is inexplicable in the results you obtained?
Re: Random shuffling
by hexcoder (Deacon) on Jun 20, 2015 at 15:47 UTC
    If your DNA sequences don't contain any separators, the code looks okay to me.

    I would optimize the processing in order to avoid wasting time and memory.

    • You only need to hold two input lines in memory, since there are no dependencies to other lines.
    • One random shuffle should be enough to make it random.
    • I would check whether the input file contains an even number of lines, that is when retrieving a pair of lines, check first, if the DNA sequence line is present.
    • $final_seq should be initialized to the empty string in order to avoid a warning on an 'undefined' value.
    So in summary I would write it like this (untested!):
    use strict; use warnings; use List::Util 'shuffle'; # Idea from http://www.perlmonks.org/?node_i +d=199901 my $input = shift @ARGV; open(IN, '<', $input) or die "Can't read multifasta input genomic DNA +file $input : $!\n"; my $destination = $input."_1MBWindow_ListUtilshuffle.fasta"; open(OUT, '>', $destination) or die "Can't write to file $destination: + $!\n"; my $window = 1000000; # hard coded for shuffle window to be 1MB i.e 10 +^6 my ($seq_id, $seq); # process every alternate line with ID (and its corresponding sequence + in next line) while (defined($seq_id = <IN>) { if (!defined($seq = <IN>) { last; } chomp $seg_id; chomp $seg; my $final_seq = ''; for (my $i = 1; $i <= length $seq; $i += $window ) { my $s = substr ($seq, $i - 1, $window); my @temp_seq_array = split //, $s; @temp_seq_array = shuffle @temp_seq_array; # using the List::U +til module AND Shuffles EACH window!!! my $rand_shuffled_seq = join ('', @temp_seq_array,); $final_seq .= $rand_shuffled_seq; # concatenates the shuffled DNA +seq to the 3' end of the previous 1MB fragment } print OUT $seq_id, "\n",$final_seq,"\n"; } close IN; close OUT;

      Please see my UPDATE 1 to the original post, perhaps your answer may change or need modification in light of the additional info I have provided?

      Thank you hexcoder. I appreciate your inputs to write out a modified version of my script to make the runs quicker and more memory efficient. Cheers! PS. There are no separators in the DNA sequences....

Re: Random shuffling
by u65 (Chaplain) on Jun 20, 2015 at 10:33 UTC

    Can you provide some sample data and expected results?

    Update: First pass through the code looks okay to me (without looking at the shuffle module) and I think I know what the input looks like: an Id line followed by a variable length string of space-separated tokens of DNA groups with possibly very long lines. So are your differing results a format difference (I suspect not) or sequence difference? For random shuffling you can't expect the same results unless you use the same random number sequence as in the comparative study.

    Update: I've tried the code and, assuming data like this,

    >ID1 A/B/C/D E/F/G/Z

    there are format errors in the output. We really need more information about the problem, especially the exact input format. (See toolic's comments below.)

      Thanks for your reply. Just to clarify, there are no separators. The sequences are strings of As Ts Gs and Cs. For example (just cooking up something):

      >ID1

      TAGCGAGCGAGCGAAC

      >ID2

      GAGCTAGAGCATGAAT

      >ID3

      GTACGACTAGCTACGGACTGACATCGGAC

      etc

Re: Random shuffling
by toolic (Bishop) on Jun 20, 2015 at 12:29 UTC
    Since you are rightly concerned that your code is correct, you need to compare your output against some verified model. Does anything in BioPerl help?
Re: Random shuffling
by BrowserUk (Pope) on Jun 20, 2015 at 16:11 UTC
    The downstream results I am getting from these randomly shuffled DNA sequences is SO VERY different from what has been published

    How do they differ?


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
    I'm with torvalds on this Agile (and TDD) debunked I told'em LLVM was the way to go. But did they listen!

      Hi BrowserUK, I am afraid that the divergence in downstream results from my work Vs prior published results goes into hardcore genomics that I am not willing to discuss the details on a public forum, especially because my work is unpublished, and I have no idea who is reading and responding my posts, and if they might be competitors

      This is just a compulsion of the publish-perish paradigm in science today! Please do not take it personally. :)

      But this issue I am posting about here in this threads, is related to our previous discussion thread on shuffling DNA and using that to estimate False Discovery Rates of a certain type of "feature" in the genome

        I am not willing to discuss the details on a public forum, especially because my work is unpublished

        Understood.

        I probably wouldn't understand the hardcore genomics anyway :)

        But ... if you could isolate the differences in non-genomic terms. Eg. The results are more random; or less random; or way too random?

        Or if you are want to discuss it off-line ...

        I am posting about here in this threads, is related to our previous discussion thread on shuffling DNA and using that to estimate False Discovery Rates of a certain type of "feature" in the genome

        I thought the processing was familiar :)

        One question. Why are you shuffling the data 10 times? Is this your innovation; or part of some procedure described somewhere that you are following?

        The reason I ask is: it shouldn't be necessary. Actually, I'll re-state that. It isn't necessary!

        The whole point about the Fischer-Yates shuffle is that every possible reordering of the data is possible; and they all have equal probability.

        Thus; by shuffling multiple times all you are doing is choosing a different mix. Not a better one; nor a more random one; just a different one.

        And that could just as equally be achieved by seeding the PRNG differently: srand( time * rand() ) or similar.

        But as all outcomes are possible from every shuffle -- including the first one -- doing multiple shuffles is just wasting time; nothing more.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
        I'm with torvalds on this Agile (and TDD) debunked I told'em LLVM was the way to go. But did they listen!
Re: Random shuffling
by BrowserUk (Pope) on Jun 21, 2015 at 00:58 UTC
    The "element" I am identifying in the genome has two ends. These two ends are each around 50-100 long. The separation between these ends is between 200 and 20,000. The length of the ends, and the separation between them is NOT known a priori. So, I wonder if there are two size ranges or periodicities whose non randomness needs to be broken down, with two successive shuffles, one in the ~ 100 length window and again in a ~1000 length window? Or some other window lengths?

    No.

    Let's start with the minimum length:

    50 + 200 + 50. There are 300! = 3.0605751221644063603537046129727e+614 possible outcomes from the shuffle.

    For one of your 50 char headers or trailers to have survived the shuffle intact -- ie, to either have remained where it was or to have been 'reconstructed' at some other position the calculation goes something like: at any given position, there are 50! = 3.0414093201713378043612608166065e+64 possible ways that the original 50 chars could have been rearranged; which makes the probability that an exact 50 sequence re-appears somewhere after one shuffle, something like 50! / 300! = 9.9373784297785355338568640895281e-551. Which is as close to 'impossible' as you could wish for.

    And that's for the shortest lengths. For your average lengths, you are talking all the computers that have ever existed, do exist and will ever exist, processing flat out until Universal heat death.

    Of course; it could also happen tomorrow -- that's the nature of random -- but duplicating the shuffle doesn't make it any more or less likely to happen.

    understand any differences between List::Util qw(shuffle) and Array::Shuffle qw(shuffle_array)

    They both use the Fischer-Yates method. But ...

    List::Util::shuffle takes the array, flattens it to a list on the stack, shuffles that list on the stack and returns it; where it is then assigned back to an array.

    Array::Shuffle takes a reference to an array and shuffles that in-place.

    The latter is faster because it avoids the array to list & list to array conversions at either end.

    Either -- over a suitable sample size -- will produce similar results. The latter gets you there more quickly.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
    I'm with torvalds on this Agile (and TDD) debunked I told'em LLVM was the way to go. But did they listen!

      Thank you for that clarification. I think I now realize I might have missed another point that is also important, so please look out for UPDATE 2 under original post.

      .
        UPDATE 2 The identification of the two ends is NOT deterministic, but based on finding a pattern, where each character in the pattern could be A or T or G or C or some combination of those 2, 3 or all 4 letters. when you search for a perfect match, and when even 1 character gets randomly shuffled you cannot detect it any more, so that is straight forward, right? But here we are NOT looking for JUST perfect matches, but there exists degeneracy in each matching position, so that when you shuffle out one letter and replace it with another, it might still be a match, albeit a different one that "scores" better or worse - that makes it a bit more complicated IMO, doesn't it? This is the actual scenario. I had forgotten to mention this earlier, apologies! How does this change things when it comes to shuffling and periodicities I was referring to in Update 1? THANK YOU! :)

        Does it change the probabilities: yes.

        Does it mean that shuffling twice is better than shuffling once: no.

        Any mix that can result from shuffling twice, could also result from shuffling once; and vice versa. So for every time when you would get a false hit after the first shuffle and not after a second; there is another case where you wouldn't get a false hit after the first shuffle and will after a second.

        I'll try to word that a different way: for any given set of data, there are a huge number of possible reorderings, a (relatively) small number of which would contain a false hit. Whether you shuffle once or 10 times; the odds of you producing one of those arrangements that contains a false hit remains exactly the same.

        The more fuzziness you allow, the higher the proportion of reorderings will contain a false hit. But with 3 or 4 mismatches in a 50 char set the ratio remains vanishingly small.

        But statistically, there is still no benefit at all to multiple shuffles.

        If my explanation isn't clear enough, perhaps running this might convince you:

        #! perl -slw use strict; use Data::Dump qw[ pp ]; $Data::Dump::WIDTH = 1000; use List::Util qw[ shuffle ]; my @data = 'a'..'d'; my %once; ++$once{ join'',shuffle @data } for 1 .. 1e6; pp \%once; my %twice; ++$twice{ join'',shuffle shuffle @data } for 1 .. 1e6; pp \%twice; my %ten; ++$ten{ join'',shuffle shuffle shuffle shuffle shuffle shuffl +e shuffle shuffle shuffle shuffle @data } for 1 .. 1e6; pp \%ten ; __END__ C:\test>shuffleStats.pl { abcd => 41717, abdc => 41646, acbd => 41468, acdb => 41646, adbc => +41673, adcb => 42050, bacd => 41883, badc => 41624, bcad => 41523, bc +da => 41667, bdac => 41775, bdca => 41282, cabd => 41674, cadb => 415 +98, cbad => 41587, cbda => 41892, cdab => 41650, cdba => 41706, dabc +=> 41452, dacb => 41859, dbac => 41638, dbca => 41895, dcab => 41600, + dcba => 41495 } { abcd => 41481, abdc => 41541, acbd => 41422, acdb => 41699, adbc => +41601, adcb => 41502, bacd => 41610, badc => 42086, bcad => 41860, bc +da => 41864, bdac => 41537, bdca => 41770, cabd => 41669, cadb => 420 +34, cbad => 41649, cbda => 41568, cdab => 41802, cdba => 41802, dabc +=> 41745, dacb => 41742, dbac => 41405, dbca => 41441, dcab => 41628, + dcba => 41542 } { abcd => 41723, abdc => 41512, acbd => 41613, acdb => 41633, adbc => +41587, adcb => 41547, bacd => 42015, badc => 41615, bcad => 41706, bc +da => 41752, bdac => 41903, bdca => 41539, cabd => 41306, cadb => 420 +37, cbad => 41673, cbda => 41579, cdab => 41767, cdba => 41582, dabc +=> 42219, dacb => 41463, dbac => 41228, dbca => 41659, dcab => 41892, + dcba => 41450 }

        No matter how many times you shuffle the data, statistically, the results remain identical. Do stddev, chi2x or any other analysis you like, and the results will be the same. Use more data; more repetitions; more shuffles; they will remain the same.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
        I'm with torvalds on this Agile (and TDD) debunked I told'em LLVM was the way to go. But did they listen!
Re: Random shuffling
by monkey_boy (Priest) on Jun 20, 2015 at 17:32 UTC

    Looks fine to me, but the best way would be to verify it yourself, print out the before & after sequences below each other and change the window size to something readable (e.g. 5).

    That said, I think the problem maybe miss-represented, are you sure you dont want to break up the DNA into window-sized chunks and shuffle the chunks? This would be a more useful exercise!

    Update: ...SO VERY different from what has been published
    It would be nice to see a link to the published work!

    This is not a Signature...

      Yeah, I've done that already, since that is the easiest thing to do. It behaves as expected for mock input files.

      Like I said in my first post, I dont want to assume I am 100% right, when I might be claiming that results in a paper from the lab of an established big wig in my field are incorrect / falsified! :(

      But now, thanks to all your inputs, I know there is no fatal error in logic for this step. So thank you all!

Re: Random shuffling
by johngg (Canon) on Jun 20, 2015 at 23:11 UTC

    As hexcoder says, you only need to read two lines at a time. Rather than using C-style loops to move along the sequence shuffling as you go you could use unpack and map; as BrowserUk has observed, there seems to be no point in shuffling more than once.

    use strict; use warnings; use List::Util qw{ shuffle }; my $inFile = shift; open my $inFH, q{<}, $inFile or die qq{open: < $inFile: $!\n}; my $outFile = $inFile . q{_shuffled.fasta}; open my $outFH, q{>}, $outFile or die qq{open: > $outFile: $!\n}; my $window = 10; while ( not eof $inFH ) { my( $id, $seq ) = map { chomp; $_ } map { scalar <$inFH> } 1 .. 2; my $shuffled = join q{}, map { join q{}, shuffle split m{} } unpack qq{(a$window)*}, $seq; print $outFH qq{$id\n$shuffled\n}; } close $inFH or die $!; close $outFH or die $!;

    I hope this is of interest.

    Update: Corrected cut'n'paste error in script, I was erroneously unpacking $shuffled instead of $seq.

    Cheers,

    JohnGG

      Hi JohnGG, please see my UPDATE 1 to the original post, perhaps your answer may change or need modification in light of the additional info I have provided?

Re: Random shuffling
by Laurent_R (Canon) on Jun 21, 2015 at 10:15 UTC
    If I understand correctly your updates, you are looking for approximate matching (aka fuzzy matching). This is much more complicated than just using regular expressions. I'll just throw here a few pointers. Fuzzy matching often implies computing the Hamming distance or the Levenshtein edit distance (look up these names), which both basically reflect the number of differences (or mismatches) between two strings.

    You might also look for the Baeza-Yates and Gonnet algorithm (sometimes also called the Baeza-Yates k-mismatches algorithm. Another known algorithm of possible interest is the Wu-Mamber k-differences. You might also take a look at the longest common subsequence problem.

    The String::Approx CPAN module might be of some interest.

    Finally I would suggest you take a deep look at the http://www.bioperl.org/wiki/Main_Page/. I am not using it personally (as I said, I am not a biologist) and can't say much about it, but it is quite big and is likely to have implemented some of the things you are trying to do. Also the functionalities in there are likely to be optimized for alphabets of 4 letters such as DNA sequences.

      Maybe it is worth pointing out that algorithms like Levenshtein & Longest Common Subsequence; and modules that implement them (like String::Approx) are entirely useless to genomists.

      They don't care about the "score"; only, is it there or not. And doing all the O(N2) or worse, work that those modules do to calculate their scores in order to derive a simple boolean answer is just way too costly.

      When they usually are doing 10s of thousands of (say) 50/4 matches against thousands of sequences, each containing millions of codons; using those types of fuzzy-match algorithms means runtimes measured in months or years.

      There are much faster ways of doing fuzzy matching when all you need is a yes/no answer.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
      I'm with torvalds on this Agile (and TDD) debunked I told'em LLVM was the way to go. But did they listen!
        Maybe it is worth pointing out that algorithms like Levenshtein & Longest Common Subsequence; and modules that implement them (like String::Approx) are entirely useless to genomists. (...)

        There are much faster ways of doing fuzzy matching when all you need is a yes/no answer.

        Thanks for the information. I frankly have a very very limited knowledge of what geneticists are doing with their DNA and other molecular sequences. I suspected that these algorithms might be slow for what geneticists are doing (which is why I recommended to look at BioPerl) but I did not know whether there were faster ways to accomplish their tasks.