in reply to simple string comparison for efficiency

CaptainF:

I think the first thing I'd do is to break the problem into a smaller set of problems to reduce the number of comparisons to make. If you have 10,000 strings to compare, a brute force approach has you making about 1x10^8 comparisons (roughly N^2). But if you can break your files into M roughly equal groups, you've already reduced the comparisons to roughly M * (N/M)^2. For M=32, it would be about 3.13x10^6 comparisons, nearly two orders of magnitude. Breaking your data set into multiple small groups will give you a big time savings.

(Yes, it's hand-wavy and I made no attempt to be rigorous. Obviously a degenerate case wouldn't do you any good, and some files will appear in multiple groups, of course. But I'm guessing that your data won't fall into the worst case. If you want rigor, please send a check for the appropriate amount, and I'll take care of it--Heh, heh!)

An approach something like the following might be able to break your list of files into plenty of groups:

```Input:
@FileNames : A list of candidate files

Step 1: Build a hash to group potentially similar strings
{
local \$/ = \32;
for my \$curFile (@FileNames) {
open INF, '<', \$curFile or die...
my \$key = <INF>;
close INF or die...

# list of files matching the first 32 chars
push @{\$Candidates{\$key}{LIST}}, \$curFile;

# regex for matching similar groups
my \$tmp = \$key;
\$tmp =~ s/N/./;
\$Candidates{\$key}{REGEX} = qr/\$r/;
}
}

Step 2: Merge compatible groups
for my \$key (keys %Candidates) {
\$Candidates{\$key}{SIMILAR} =
grep { /\$Candidates{\$key}{REGEX}/ && (\$_ ne \$key) }
keys %Candidates;
}

Step 3: Remove unmatchable items
my %UnmatchableKeys;
for my \$key (keys %Candidates) {
next if exists \$Candidates{\$key}{SIMILAR};
next if scalar @{\$Candidates}{\$key}{LIST}} > 1;
\$UnmatchableKeys{\$key} = 1;
}

Step 4: Generate lists of items needing to be compared
for my \$key (keys %Candidates} {
next if exists \$UnmatchableKeys{\$key};
my @similarFiles = (@{\$Candidates}{\$key}{LIST}});
for my \$others (keys %{\$Candidates{\$key}{SIMILAR}}) {
push @similarFiles, @{\$Candidates{\$key}{LIST}};
}
print "GROUP: ", join(", ", sort @similarFiles), "\n";
}