Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses

Re: string match using with an N in any position

by Marshall (Abbot)
on Nov 18, 2011 at 03:08 UTC ( #938733=note: print w/replies, xml ) Need Help??

in reply to string match using with an N in any position

There are a number of ways to proceed. I would consider a dynamic regex. Bascially write a little subroutine that writes the regex! /^$regex/ where $regex winds up being something like: "NCGAT|GNGAT|GCNAT|GCGNT|GCGAN".

You can use substr or whatever in the code that lists out the combinations. When the regex engine compiles this automagically built regex, it can deal with it very efficiently even if there are a lot of "OR" terms because there isn't any backtracking or forward looking or anything fancy.

So I am suggesting: think simple regex that is program generated. Oh, I don't see any reason for the search patterns to be in hash keyed by a sequential number, why not just have an array?

Update: I saw BrowserUk's solution and I like it. I wasn't aware that you could calculate the XOR byte by byte of 2 strings in a single Perl operation. Cool. There are some approximate pattern matching algorithms (eg agrep) where this is useful and the manner in which BrowserUk is using the XOR function is apropos.

Perl has made some very significant enhancements to the regex engine as of late. I am still on Perl 5.10. What I am wondering (I don't know the answer) is how sophisticated the regex compiler has become in 5.14 and what its optimization points are. For example if the data to matched against starts with 'C', then none of the optional patterns will ever match and all of the rest of the calculations become irrelevant - first character isn't a 'G' or 'N' so: stop! If it does that then the result is likely to be faster than the procedure of XOR'ing the strings together and going thru the resultant XOR produced string to count the 0's (fewer character by character operations get done on average vs a smarter "give up early" strategy..well maybe..).

What I am suggesting is that if performance matters, benchmarking is definitely in order!

I have used the above algorithm in other situations where for example certain character positions can be swapped and others not. The requirement described in the OP is at the lower limit of what this "write a program to write a program" approach can do (the regex is essentially a program that is compiled by the regex engine at run time). Next year I am going to attempt a performance increase on my regex writing code - I'll report back if I find something particularly stunning.

  • Comment on Re: string match using with an N in any position

Replies are listed 'Best First'.
Re^2: string match using with an N in any position
by bart (Canon) on Nov 18, 2011 at 11:55 UTC
    Here's a snippet of code that implements what you describe:
    sub fuzzy { my $string = shift; my @alt; for(my $i = 0; $i < length($string); $i++) { my $alt = $string; substr($alt, $i, 1) = 'N'; push @alt, $alt; } return join '|', $string, @alt; # original too } print fuzzy('GCGAT');
    You can just do
    $re = fuzzy($string); if($value =~ /^($re)/) { ... }
    Note the parens for grouping.

    If there's only one value for $re during the run of the file, you can use /o (as in /^($re)/o) but in a modern perl I don't think that makes much difference.

      Yes, this is the idea!
      I was out of town and using a friend's laptop without Perl installed and I was reluctant to post "untested code". This idea applies also in much more complex situations. I have one function that can generate regex'es with 4-12 terms from very similar length "search for" strings due to the "rules".

      Anyway if the "rules" can be described as a regex generation algorithm, then I am suggesting in this thread an "additional tool to add to the Perl toolbox".

      This had the added benefit of when the end user thinks that something should have either matched or not matched, the test output of the module can print regex'es for test cases (mine does). And the terms are simple enough that a user familiar with Windows '.', and '*' wildcards can tell me some new case to add or delete.

      This "write the regex" code replaced what my working group came to call "the regex from hell" - so complicated that you might have to read Mastering Regular Expressions by Jeffrey-Friedl several times before you could understand it!

        Here's another idea: Regexp::Assemble is a module that can combine regexps — and possibly optimize the result. Granted, nowadays perl already optimizes large regexps, so it's not as useful any more as it once was.

        Turning a glob pattern into a regexp isn't that hard: you can already get far by replacing every "?" with ".", "*" with ".*", and quotemeta everything else.

        %s = ( '?' => '.', '*' => '.*' ); s/(\W)/$s{$1} || "\\$1"/ge;

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://938733]
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others taking refuge in the Monastery: (5)
As of 2018-04-20 05:46 GMT
Find Nodes?
    Voting Booth?