CaptainF:

I think the first thing I'd do is to break the problem into a smaller set of problems to reduce the number of comparisons to make. If you have 10,000 strings to compare, a brute force approach has you making about 1x10^8 comparisons (roughly N^2). But if you can break your files into M roughly equal groups, you've already reduced the comparisons to roughly M * (N/M)^2. For M=32, it would be about 3.13x10^6 comparisons, nearly two orders of magnitude. Breaking your data set into multiple small groups will give you a big time savings.

(Yes, it's hand-wavy and I made no attempt to be rigorous. Obviously a degenerate case wouldn't do you any good, and some files will appear in multiple groups, of course. But I'm guessing that your data won't fall into the worst case. If you want rigor, please send a check for the appropriate amount, and I'll take care of it--Heh, heh!)

An approach something like the following might be able to break your list of files into plenty of groups:

Input: @FileNames : A list of candidate files Step 1: Build a hash to group potentially similar strings { local $/ = \32; for my $curFile (@FileNames) { open INF, '<', $curFile or die... my $key = <INF>; close INF or die... # list of files matching the first 32 chars push @{$Candidates{$key}{LIST}}, $curFile; # regex for matching similar groups my $tmp = $key; $tmp =~ s/N/./; $Candidates{$key}{REGEX} = qr/$r/; } } Step 2: Merge compatible groups for my $key (keys %Candidates) { $Candidates{$key}{SIMILAR} = grep { /$Candidates{$key}{REGEX}/ && ($_ ne $key) } keys %Candidates; } Step 3: Remove unmatchable items my %UnmatchableKeys; for my $key (keys %Candidates) { next if exists $Candidates{$key}{SIMILAR}; next if scalar @{$Candidates}{$key}{LIST}} > 1; $UnmatchableKeys{$key} = 1; } Step 4: Generate lists of items needing to be compared for my $key (keys %Candidates} { next if exists $UnmatchableKeys{$key}; my @similarFiles = (@{$Candidates}{$key}{LIST}}); for my $others (keys %{$Candidates{$key}{SIMILAR}}) { push @similarFiles, @{$Candidates{$key}{LIST}}; } print "GROUP: ", join(", ", sort @similarFiles), "\n"; }

(Note: Horribly untested, and I probably botched a good bit of the syntax.)

There some improvements you can make, such as removing identical groups, etc. You can also apply this recursively, taking the next N bytes and further partitioning your sets until you run out of text.

Another approach you might take is figuring out a coding scheme where you can compute a code for each string and build a hash of similar codes to find similar items. But I couldn't think of a suitable scheme that would work with wildcarded 'N' characters--hence the brute force approach.

...roboticus

In reply to Re: simple string comparison for efficiency by roboticus
in thread simple string comparison for efficiency by CaptainF

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.