Beefy Boxes and Bandwidth Generously Provided by pair Networks Bob
Your skill will accomplish
what the force of many cannot

Finding a _Similar_ Substring? (Fuzzy Searching?)

by rjahrman (Scribe)
on May 21, 2004 at 02:27 UTC ( #355155=perlquestion: print w/ replies, xml ) Need Help??
rjahrman has asked for the wisdom of the Perl Monks concerning the following question:

I am searching to see if $string contains a substring, let's say "P100". However, I also want it to match to "P-100" and "P 100", or even "P1 00". The P100 is a variable; right now I have "if ($string =~ /\Q$model\E/)", so whatever I do needs to be programmatic . . . any ideas?

Thanks for the help.

Comment on Finding a _Similar_ Substring? (Fuzzy Searching?)
Re: Finding a _Similar_ Substring? (Fuzzy Searching?)
by duff (Vicar) on May 21, 2004 at 02:35 UTC
Re: Finding a _Similar_ Substring? (Fuzzy Searching?)
by BUU (Prior) on May 21, 2004 at 02:52 UTC
    Actually, reading your requirements, it sounds like a better solution might be to define a list of characters that "don't matter" when you're matching (or doing whatever you want to do). An easy way to do this would be something like:
    my @ignore=(' ','-'); #whatever for(@ignore){ s/$_//g; } #match against $_

      Since in this type of situation I'd normally expect the one pattern to be matched against many strings, I'd usually aim to approach this instead by modifying the regexp:

      my @ignore=(' ','-'); #whatever my $ignoreclass = sprintf '[%s]', join '', map quotemeta, @ignore; $re = join $ignoreclass, split //, $re;

      Of course this is only so simple if the initial pattern is a simple string: a full-on regexp is rather more difficult to introduce such modifications to reliably.


      If your ignore set are too complicated for character classes, you can OR them together into a regex. I doubt it would be necessary here, more likely for sets fo words.

      my $ignoreStrings = join "|", @ignore; my $deleteThese = qr/$ignoreStrings/g; $strting =~ s/$deleteThese//;

      By the way, you're using $_ to represent the various elements of @ignore, but also to denote the default object of s///. That's why I tend to avoid defaults .... better to be explicit, self-documenting, and avoid irritating errors.


Re: Finding a _Similar_ Substring? (Fuzzy Searching?)
by BrowserUk (Pope) on May 21, 2004 at 03:21 UTC

    Depending upon how loose you want the criteria to be, you might get away with something like this.

    my $term = 'P100'; ## my $re = qr[@{[ join '\W*', split '', $term ]}]; # Improved slightl +y. my $re = qr[@{[ join '\W*', map "\Q$_\E", split '', $term ]}]x; for( 'P100', 'P-100', 'P 100', 'P1 00', 'the P 100 is very similar in style to the P-101 & P102.'. 'The P-100 is a generation behind the P1000' ) { print "Matched $1" while m[\b($re)\b]g; };; Matched P100 Matched P-100 Matched P 100 Matched P1 00 Matched P 100 Matched P-100

    You could also add /i if you want case insensitivity.

    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "Think for yourself!" - Abigail
      What exactly are you doing in the regexes at the top? What's the difference between the first and second one?
Re: Finding a _Similar_ Substring? (Fuzzy Searching?)
by ambrus (Abbot) on May 21, 2004 at 11:47 UTC
    If, as others have suggested, you want most characters get ignored, you could strip all those characters (with y///d) from both the haystack and the needle string, and then perform a match. Also, you may want to use case-insensitive matching.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://355155]
Approved by Old_Gray_Bear
Front-paged by broquaint
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others meditating upon the Monastery: (18)
As of 2014-04-17 13:48 GMT
Find Nodes?
    Voting Booth?

    April first is:

    Results (448 votes), past polls