Finding a _Similar_ Substring? (Fuzzy Searching?)

I am searching to see if $string contains a substring, let's say "P100". However, I also want it to match to "P-100" and "P 100", or even "P1 00". The P100 is a variable; right now I have "if ($string =~ /\Q$model\E/)", so whatever I do needs to be programmatic . . . any ideas?

by duff (Vicar) on May 21, 2004 at 02:35 UTC
by BUU (Prior) on May 21, 2004 at 02:52 UTC
    Actually, reading your requirements, it sounds like a better solution might be to define a list of characters that "don't matter" when you're matching (or doing whatever you want to do). An easy way to do this would be something like:
    my @ignore=(' ','-'); #whatever for(@ignore){ s/$_//g; } #match against $_

      Since in this type of situation I'd normally expect the one pattern to be matched against many strings, I'd usually aim to approach this instead by modifying the regexp:

      my @ignore=(' ','-'); #whatever my $ignoreclass = sprintf '[%s]', join '', map quotemeta, @ignore; $re = join $ignoreclass, split //, $re;

      Of course this is only so simple if the initial pattern is a simple string: a full-on regexp is rather more difficult to introduce such modifications to reliably.


      If your ignore set are too complicated for character classes, you can OR them together into a regex. I doubt it would be necessary here, more likely for sets fo words.

      my $ignoreStrings = join "|", @ignore; my $deleteThese = qr/$ignoreStrings/g; $strting =~ s/$deleteThese//;

      By the way, you're using $_ to represent the various elements of @ignore, but also to denote the default object of s///. That's why I tend to avoid defaults .... better to be explicit, self-documenting, and avoid irritating errors.


by BrowserUk (Pope) on May 21, 2004 at 03:21 UTC

    Depending upon how loose you want the criteria to be, you might get away with something like this.

    my $term = 'P100'; ## my $re = qr[@{[ join '\W*', split '', $term ]}]; # Improved slightl +y. my $re = qr[@{[ join '\W*', map "\Q$_\E", split '', $term ]}]x; for( 'P100', 'P-100', 'P 100', 'P1 00', 'the P 100 is very similar in style to the P-101 & P102.'. 'The P-100 is a generation behind the P1000' ) { print "Matched $1" while m[\b($re)\b]g; };; Matched P100 Matched P-100 Matched P 100 Matched P1 00 Matched P 100 Matched P-100

    You could also add /i if you want case insensitivity.

      What exactly are you doing in the regexes at the top? What's the difference between the first and second one?
by ambrus (Abbot) on May 21, 2004 at 11:47 UTC
    If, as others have suggested, you want most characters get ignored, you could strip all those characters (with y///d) from both the haystack and the needle string, and then perform a match. Also, you may want to use case-insensitive matching.

