Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

non-exact regexp matches

by vinforget (Beadle)
on Jun 23, 2004 at 14:38 UTC ( [id://369048]=perlquestion: print w/replies, xml ) Need Help??

vinforget has asked for the wisdom of the Perl Monks concerning the following question:

Hi all, I've been tring to find a way to detect "partial/wobbly" regexp matches i.e. given a string of charaters, return all matches to a regexp, with maximum of 2 failures/mismatches. Example:
$str = "PolarBear"; $str =~ /polarbear/; #returns 2, and/or [0,5] for positions ^ ^ $str =~ /Polerbear/; #returns 2, and/or [3,5] for positions ^ ^ $str =~ /Polarbeer/; #returns 3, and/or [5,7] for positions ^ ^
How would I go about, or is it even possible, to return the number of times the reg-exp fails and/or list the positions in the string. Thanks. P.S. I’ve looked at node id=74651, but is this what I am looking for.... is it stable?

Replies are listed 'Best First'.
Re: non-exact regexp matches
by McMahon (Chaplain) on Jun 23, 2004 at 14:44 UTC
    string-approx might get you there.
    Maybe also levenshtein

    I haven't tried these, but I've been reading about them recently.
      Thanks ! These are pretty cool (I could see come use for them in my work), but I'm looking for something which I could use a regexp:
      $str = "PolarBear"; $str =~ /[Pp]olar[Bb]e[ae]r/;
      if $str is "polarbear" or "polarbeer" or "Polarbear" returns 0
      if $str is "PolarBeat returns 1
      if $str is "PelarBeat returns 2
      etc,etc.
        Ah, I see. (Also didn't realize that String::Approx doesn't use regexes)

        So this is crazy, but it just might work: stick each letter of one string into one array, then each letter of the other string into the other array. Then use List::Compare to figure out how the two strings are different.

        List::Compare is my favorite hammer these days. =)
Re: non-exact regexp matches
by Fletch (Bishop) on Jun 23, 2004 at 15:00 UTC
      Thanks. One of my coworkers has this book, but he's not in today ! I may just end up buying this book... it keep on coming up in regexp discussion! Duh! Vince
Re: non-exact regexp matches
by diotalevi (Canon) on Jun 23, 2004 at 16:04 UTC
Re: non-exact regexp matches
by Roy Johnson (Monsignor) on Jun 23, 2004 at 16:24 UTC
    You can specify alternatives, and use the (?{}) construct to keep track of how many times those alternatives were required. You have to keep track of how when you backtrack, so you don't count those multiple times (if your regex has that possibility).

    The alternatives might allow the regex engine to match less aggressively than you think it ought to, so you could end up with a higher miss rate than you'd like. In the code below, you get an array of the checkpoints that it had to use the alternative. Play with different values of $str.

    use strict; my $str = 'PolaBexar'; # Missing letter, extra letter, two capitals my @checkpoints = (); sub stack_checkpoints { my $val = shift; # Remove all checkpoint markers higher than $val @checkpoints = grep { $val>$_ } @checkpoints; print "Backtracking to $val\n"; push @checkpoints, $val; } print "Matched <$&> with ".@checkpoints." misses\n" if $str =~ /(?:p|.{0,1}?(?{stack_checkpoints 0})) (?:o|.{0,1}?(?{stack_checkpoints 1})) (?:l|.{0,1}?(?{stack_checkpoints 2})) (?:a|.{0,1}?(?{stack_checkpoints 3})) (?:r|.{0,1}?(?{stack_checkpoints 4})) (?:b|.{0,1}?(?{stack_checkpoints 5})) (?:e|.{0,1}?(?{stack_checkpoints 6})) (?:a|.{0,1}?(?{stack_checkpoints 7})) (?:r|.{0,1}?(?{stack_checkpoints 8})) /x;
    (quick update: made the alternatives non-greedy)

    We're not really tightening our belts, it just feels that way because we're getting fatter.
Re: non-exact regexp matches
by vinforget (Beadle) on Jun 23, 2004 at 17:27 UTC
    I refined my question a little more. I have a string of letters [ACGTacgtNn] from which I want to find a particular instance of a regexp, let's say:
    /ACCAAC[ACGTacgtNn]{6}CTA[ACGTacgtNn]{1}ATG[ACGTacgtNn]{1,2}GATGTT/

    I can do this just fine, but what if I want to match the above regexp with a tolerance of 2 mismatches for single characters. Below I have an example:
    $buf =~ m/(A)(C)(C)(A)(A)(C)([ACGTacgtNn]{6})(CTA[ACGTacgtNn]{1})(A)(T +)(G)([ACGTacgtNn]{1,2})(G)(A)(T)(G)(T)(T)(?{ print $-[0]," ",scalar@-,"\n"; })(?!)/;
    this will print the position of the match in $buf, followed by 19 (the number of submatches). I want to be able to return a match from 17-19 submathes, not just all 19. Thanks. Vince
      I have a mechanism for you. Right now, it requires that you break your regex up into pieces yourself, but once I have Regexp::Parser completed, this mechanism will be available via Regexp::Parser::Fuzzy.

      It tries to be smart, making sure that when it does an "insert", it's not inserting the next thing it was supposed to match anyway (I don't think that breaks anything), and that when it does a "modify", it doesn't match the thing it was supposed to try to match.

      Also, right now, it just prints the matches. If you tell me this program does what you need it to do, then I'll help make it more useful. If the regex is something that you don't have control over (that is, it's user input), then you're going to need a regex parser to help you split it up...

      my $rx = mk_fuzzy(0, 1, 0, qw( p e r l )); "pearl" =~ $rx; # mk_fuzzy(MODs, INSs, DELs, parts...) sub mk_fuzzy { our ($m, $i, $d) = splice @_, 0, 3; use re 'eval'; qr{ (?{ [ $i, $d, $m ] }) ^ @{[ map qq{ (?: $_[$_] (?: | (?(?{ \$^R->[0] }) @{[ $_ < $#_ and "(?! $_[$_+1] + )" ]} (?s: . ) (?{ [ \$^R->[0] - 1, \$^R->[1], \$^R->[2] ] }) | (?!) + ) ) | (?(?{ \$^R->[1] }) (?{ [ \$^R->[0], \$^R->[1] - 1, \$^R->[2] + ] }) | (?!) ) | (?(?{ \$^R->[2] }) (?! $_[$_] ) (?s: . ) (?{ [ \$^R->[0], \$ +^R->[1], \$^R->[2] - 1 ] }) | (?!) ) ) }, 0 .. $#_ ]} $ (?{ printf ">> %s (M=%d/%d, I=%d/%d, D=%d/%d)\n", $&, $m-$^R->[2], + $m, $i-$^R->[0], $i, $d-$^R->[1], $d }) (?!) }x; }
      _____________________________________________________
      Jeff[japhy]Pinyan: Perl, regex, and perl hacker, who'd like a job (NYC-area)
      s++=END;++y(;-P)}y js++=;shajsj<++y(p-q)}?print:??;
        Good thing you used the /x modifier, or that regex would be hard to read!

        ;-)


        We're not really tightening our belts, it just feels that way because we're getting fatter.
Re: non-exact regexp matches
by bageler (Hermit) on Jun 23, 2004 at 18:57 UTC
    why does it need to be in a regexp? Sounds like you just want a string comparator, unless you have more complex patterns in mind.
    $str = 'PolarBear'; my @str = split//,$str; my @pats = qw(polarbear Polerbear Polarbeer); for (@pats) { my @mat = mismatches($_,$str); print "Mismatches for $_ vs $str: @mat\n"; } sub mismatches { my ($p1,$p2) = @_; my @foo = split//,$p1; my @str = split//,$p2; my @pos; if ($#foo > $#str) { for (0 .. $#foo) { push @pos, $_ if $foo[$_] ne $str[$_]; } } else { for (0 .. $#str) { push @pos, $_ if $foo[$_] ne $str[$_]; } } return @pos; }
      The problem is that I have a query string that is rather short, and a subject string that is long (~1 million chars) that can contain multiple nested matches.
Re: non-exact regexp matches
by wufnik (Friar) on Jun 23, 2004 at 18:24 UTC
    what you/we really want is an implementation of the below - an optimal way to approximately match regular expressions. why this rather than the others above?

    1: Bull Math Biol. 1989;51(1):5-37. Approximate matching of regular expressions. Myers EW, Miller W.


    none of the above are able to compare "edit distances" *for regular expressions* in the way the Text::Levenshtein etc allow the comparison of these edit distances for strings. instead, they quite effectively hardwire a greater degree of flexibility into the patterns that can be recognized. but to do this properly, you need to 'penalize' insertions/deletions in your regexp in the same way you do for sequences. the above paper outlines a way of doing this. as for implementation - I don't know.

    is there something around the BioPerl guys might know of?
    ...wufnik

    -- in the world of the mules there are no rules --
      What I want is a little simpler. I would just need to match the characters and not the character classes/intervals because spacing is deemed to be important in this case. I just want to allow for a certain number substitutions for the fixed characters. I will still read the paper... I may find something that will help me get to a partial solution. Thanks
      Vince
Re: non-exact regexp matches
by vinforget (Beadle) on Aug 06, 2004 at 15:29 UTC
    Just to keep the thread up-to-date, I have found the following interesting link. http://pauillac.inria.fr/algo/nicodeme/publications.html . Vince

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://369048]
Approved by Happy-the-monk
Front-paged by broquaint
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others musing on the Monastery: (4)
As of 2024-03-29 11:59 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found