Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

Re: Splitting strings into words when there are no separators

by japhy (Canon)
on Sep 14, 2005 at 16:16 UTC ( #491916=note: print w/ replies, xml ) Need Help??


in reply to Splitting strings into words when there are no separators

As far as fuzzy matching goes, I posted code once that showed how to do fuzzy matching with Perl regexes. Here's an updated version:

sub mk_fuzzy { our ($m, $i, $d) = splice @_, 0, 3; use re 'eval'; qr{ (?{ [ $i, $d, $m ] }) ^ @{[ map $_[$_] =~ /!$/ ? substr($_[$_],0,-1) : qq{ (?: $_[$_] (?: | (?(?{ \$^R->[0] }) @{[ $_ < $#_ and "(?! $_[$_+1] + )" ]} (?s: . ) (?{ [ \$^R->[0] - 1, \$^R->[1], \$^R->[2] ] }) | (?!) + ) ) | (?(?{ \$^R->[1] }) (?{ [ \$^R->[0], \$^R->[1] - 1, \$^R->[2] + ] }) | (?!) ) | (?(?{ \$^R->[2] }) (?! $_[$_] ) (?s: . ) (?{ [ \$^R->[0], \$ +^R->[1], \$^R->[2] - 1 ] }) | (?!) ) ) }, 0 .. $#_ ]} $ (?{ [ [$m-$^R->[2], $m], [$i-$^R->[0], $i], [$d-$^R->[1], $d] ] }) }x; }
The use of the function is as follows:
my $test = mk_fuzzy( 1, # max number of modifications to allow 1, # max number of insertions to allow 1, # max number of deletions to allow qw( p e r l ) ); for my $word (qw( pearl earl pearly peely )) { if ($word =~ $test) { print "$word is close enough to 'perl'\n"; } else { print "$word isn't enough like 'perl'\n"; } }
This reports that pearl, earl, and peely are close to "perl".

The reason I send the letters of the word individually is because the function allows you to follow a letter with a ! which means it MUST appear in the word. (And the 'letters' don't have to be just letters, they could be multi-character strings.) Example:

my $test = mk_fuzzy( 1, # max number of modifications to allow 1, # max number of insertions to allow 1, # max number of deletions to allow qw( p e r! l ) # must have the 'r' in this relative location ); for my $word (qw( pearl earl pearly peely )) { if ($word =~ $test) { print "$word is close enough to 'perl'\n"; } else { print "$word isn't enough like 'perl'\n"; } }
This one only reports pearl and earl, since peely didn't keep the 'r'.

After a successful match, by the way, you can inspect $^R to see how many modifications, insertions, and deletions were necessary:

if ($word =~ $test) { my ($m_used, $m_allowed) = @{ $^R->[0] }; my ($i_used, $i_allowed) = @{ $^R->[1] }; my ($d_used, $d_allowed) = @{ $^R->[2] }; print "Using $m_used mods, $i_used inserts, and $d_used dels, $word +matched\n"; }

Jeff japhy Pinyan, P.L., P.M., P.O.D, X.S.: Perl, regex, and perl hacker
How can we ever be the sold short or the cheated, we who for every service have long ago been overpaid? ~~ Meister Eckhart


Comment on Re: Splitting strings into words when there are no separators
Select or Download Code

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://491916]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others wandering the Monastery: (7)
As of 2015-07-02 04:36 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (27 votes), past polls