in reply to simple string comparison for efficiency


I think the first thing I'd do is to break the problem into a smaller set of problems to reduce the number of comparisons to make. If you have 10,000 strings to compare, a brute force approach has you making about 1x10^8 comparisons (roughly N^2). But if you can break your files into M roughly equal groups, you've already reduced the comparisons to roughly M * (N/M)^2. For M=32, it would be about 3.13x10^6 comparisons, nearly two orders of magnitude. Breaking your data set into multiple small groups will give you a big time savings.

(Yes, it's hand-wavy and I made no attempt to be rigorous. Obviously a degenerate case wouldn't do you any good, and some files will appear in multiple groups, of course. But I'm guessing that your data won't fall into the worst case. If you want rigor, please send a check for the appropriate amount, and I'll take care of it--Heh, heh!)

An approach something like the following might be able to break your list of files into plenty of groups:

Input: @FileNames : A list of candidate files Step 1: Build a hash to group potentially similar strings { local $/ = \32; for my $curFile (@FileNames) { open INF, '<', $curFile or die... my $key = <INF>; close INF or die... # list of files matching the first 32 chars push @{$Candidates{$key}{LIST}}, $curFile; # regex for matching similar groups my $tmp = $key; $tmp =~ s/N/./; $Candidates{$key}{REGEX} = qr/$r/; } } Step 2: Merge compatible groups for my $key (keys %Candidates) { $Candidates{$key}{SIMILAR} = grep { /$Candidates{$key}{REGEX}/ && ($_ ne $key) } keys %Candidates; } Step 3: Remove unmatchable items my %UnmatchableKeys; for my $key (keys %Candidates) { next if exists $Candidates{$key}{SIMILAR}; next if scalar @{$Candidates}{$key}{LIST}} > 1; $UnmatchableKeys{$key} = 1; } Step 4: Generate lists of items needing to be compared for my $key (keys %Candidates} { next if exists $UnmatchableKeys{$key}; my @similarFiles = (@{$Candidates}{$key}{LIST}}); for my $others (keys %{$Candidates{$key}{SIMILAR}}) { push @similarFiles, @{$Candidates{$key}{LIST}}; } print "GROUP: ", join(", ", sort @similarFiles), "\n"; }

(Note: Horribly untested, and I probably botched a good bit of the syntax.)

There some improvements you can make, such as removing identical groups, etc. You can also apply this recursively, taking the next N bytes and further partitioning your sets until you run out of text.

Another approach you might take is figuring out a coding scheme where you can compute a code for each string and build a hash of similar codes to find similar items. But I couldn't think of a suitable scheme that would work with wildcarded 'N' characters--hence the brute force approach.