http://www.perlmonks.org?node_id=937241

shamshersingh has asked for the wisdom of the Perl Monks concerning the following question:

I have a large set (100000+) of short DNA reads 20 characters long. I need to compares all reads against each other and pull out those that vary by just 1 position. Heres the script I came up with.
$| = 1; my $compare_count = 0; for (my $i = 0; $i < @kmers; $i++ ) { for (my $j = $i + 1; $j < @kmers; $j++ ) { print "\rComparing sequence $i to $j"; my @result = PCCompare::dissimilarity($kmers[$i], $kmers[$j], +1); if ($result[0] == 1) { print "\rMatch found: $kmers[$i], $kmers[$j]\n"; push @variant_kmers, ($kmers[$i], $kmers[$j]); } $compare_count++; } } print "\rFinished: $compare_count comparisions made.\n";
The problem is that this loop runs very very slow. It takes on the orders of days to process 100000 sequences. Is there a way to make the process faster?