http://www.perlmonks.org?node_id=961099

jrblas has asked for the wisdom of the Perl Monks concerning the following question:

I want to find all the matches of a given pattern in a string but doing fuzzy (approximate) pattern matching. I want to allow small variations (1 or 2 positions substituted) in the match. I have tried String::approx module, but I do not know how to manage syntax

... use String::approx 'amatch'; my $pattern = "JEJE"; my $string = "EJKJUJHJDJEJEJEDEJOJOJJJAHJHJSHJEFEJUJEJUJKIJS"; while (?=/$pattern/) { ...

How could I achieve that this 'while' allow 1 substitution from $pattern? That is that "JEFE", "JUJE","JEDE",... would be true matches.

Thanks a lot in advance. JR

Replies are listed 'Best First'.
Re: approximate regular expression
by moritz (Cardinal) on Mar 22, 2012 at 20:59 UTC

    If pattern and number of allowed substitutions are small, you can simply throw a regex at it:

    use strict; use warnings; use 5.010; my $pattern = "JEJE"; my $string = "EJKJUJHJDJEJEJEDEJOJOJJJAHJHJSHJEFEJUJEJUJKIJS"; my @parts = qw/ JE.. J..E ..JE J.J. .E.E .EJ. /; my $regex = join '|', @parts; while ($string =~ /($regex)/g) { say "$1 at position ", pos($string); } __END__ JKJU at position 5 JHJD at position 9 JEJE at position 13 JEDE at position 17 JOJO at position 21 JJJA at position 25 JHJS at position 30 JEFE at position 35 JUJE at position 39 JUJK at position 43

    If you also want overlapping matches, use while ($string =~ /(?=($regex))./g) { ... }

Re: approximate regular expression
by AnomalousMonk (Archbishop) on Mar 22, 2012 at 23:00 UTC

    Some variations on a Text::LevenshteinXS distance theme. The substr version (d) might be best (i.e., fastest), but I haven't Benchmark-ed anything. All these solutions find overlapping matches. All examples are of distance 1 (Update: but other distances could be used). (Sorry for any line-wrap in the output.)

      This is a very good option (I've tried as you specify and it works). However, I need to put in $pattern a regular expression. This regexp would vary both in nature and size. For instance: /^D{3,12}DJH^F{1,4}../ How could I explore for 'approximate matches to this regexp'?

        As always, the Devil is in the details. Unless and until you can make clear (first to yourself, then to others) the meaning of the phrase "an approximate match to a regex", you may make but little progress. To me, for instance, the notion of a "regex match" already incorporates vague notions of fuzzyness and approximation. Just what does it mean to add approximation to approximation, or to measure its degree?

        Unless you write your regular expression engine, and use the plugin system to use it, regular expressions are really the wrong way to attack this.
Re: approximate regular expression
by Marshall (Canon) on Mar 23, 2012 at 02:34 UTC
    I like the idea from moritz.

    One aspect about this that may not be clear is that Perl can generate and use dynamic regex's "on the fly" - meaning that your code can dynamically generate a regex as a string and that regex can be used in the code later on. This "dynamic run-time generation of regex's" is a "magical feature of Perl" - and it works great!

    Update:

    #!/usr/bin/perl -w use strict; my $pattern = "JEJE"; my $string = "EJKJUJHJDJEJEJEDEJOJOJJJAHJHJSHJEFEJUJEJUJKIJS"; # from the $pattern, generate a $regex like this: my $regex = "JEJE|.EJE|J.JE|JE.E|JEJ."; # and use it like this: my (@matches) = $string =~ m/$regex/g; print join ("\n", @matches), "\n"; __END__ prints: JDJE JEJE JEFE JUJE
    I'm not sure what the desired output should be (overlapping or not).
    The basic idea that I am saying is that you can generate a regex "on-the-fly" based upon some input and use it in subsequent code.

    Update:
    I am unsure about the fastest way (execution time-wise) to generate the combinations for the $regex - here is one attempt. This needs to be expanded to account for ".." two "anythings" in the pattern. But I think the basic idea is sound, generate a regex pattern dynamically, compile it, and run it against the input dataset.

    #!/usr/bin/perl -w use strict; my $pattern = "JEJE"; my @patterns = ($pattern); for (my $i=0; $i<length($pattern); $i++) { my $copy = $pattern; substr ($copy,$i,1) = "."; push @patterns, $copy; } print join("|",@patterns), "\n"; #prints: JEJE|.EJE|J.JE|JE.E|JEJ.
    I'm not sure that something like this would be faster, might even be slower.. A 'C' implementation like this would be very fast, but in Perl, I am not sure.
    for (my $i=0; $i<length($pattern); $i++) { my $saved_char = substr ($pattern,$i,1); substr ($pattern,$i,1) = "."; push @patterns, $pattern; substr ($pattern,$i,1) = $saved_char; }
Re: approximate regular expression
by jandrew (Chaplain) on Mar 22, 2012 at 23:37 UTC

    If you would like a non-regex brute force method.

      Yes, split() is certainly "brute force".
      If you have bench-marked this, you know that this is a very "expensive operation".
      @array = split (//,$some_var) is super "expensive" and your code does it many times.

      Going "with the flow" of the language is (usually) going to execute faster and in general "be better", meaning easier to understand.

        Marshall thank you for your feedback

        Honestly I don't have a good handle on what perl "with the flow" really means. I guess I was responding to jrblas's request regarding fuzzy regex's. And by that I mean that fuzzy regex's mostly land in the TODO bucket of the regex wizards from what I have read. I do say that as a regex weakling so there may be something out there that I don't know about. Specifically Marpa seems to promise some alternatives but that is even farther beyond my current grasp.

        With that said I have to confess to laziness in calculating the match score. As a guess the original question appears to fall in the bio-perl realm which upon further study would also benefit from regex Look-Around add-ons. So I offer the following in penance.