Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?

searching for a string w/ a * in any single position?

by mdunnbass (Monk)
on Oct 05, 2007 at 21:16 UTC ( #642987=perlquestion: print w/replies, xml ) Need Help??
mdunnbass has asked for the wisdom of the Perl Monks concerning the following question:

Forgive me monks, for i have sinned. It has been many months since my last visit to the monastery... So, here's a question for anyone wiser than me (read: everybody here) - if I have a string, let's say $string = abcdef, and I want to search a textfile, I can do a simple m//

But, what if i want to be able to match any 5 out of the 6 characters in string? i.e., the following would all be desired matches to the pattern:

*{1}bcdef a*{1}cdef ab*{1}def abc*{1}ef abcd*{1}f abcde*{1}
but none of the following would match:
abdef*{1} adcbfe ab*{1}cde etc....

How would I set up that as a search pattern, or a regexp? And, of course, the $64,000 twist to the question becomes, if $string is user input, and rather than a strict 5/6 match, the user wants to define the stringency of the match (i.e., allow $n non-matching characters in $string, how could I do that?

I know that I could laboriously go through and search for each of the entries in that first code block above, but I'd prefer to automate this task, and the length of $string may regularly be 10-15 characters long, with up to 5 non-matching characters permitted per search. That becomes a large quantity of different combinations to throw in to the search... Any ideas on how to do it better?

Any and all help would be greatly appreciated,

Replies are listed 'Best First'.
Re: searching for a string w/ a * in any single position?
by blokhead (Monsignor) on Oct 05, 2007 at 22:59 UTC
    Looks like you want to compute the Hamming distance (number of character positions in which they disagree) between two words. Do you really really need a regex, or will another way of matching be ok?

    If you're not strongly committed to using a regex, use something like this (probably will do strange things for non-ascii character sets):

    sub hamming { my ($x1, $x2) = @_; return -1 if length($x1) != length($x2); (my $xor = $x1 ^ $x2) =~ tr/\x0//cd; } print hamming(@$_), $/ for [qw[ abcdef abccef ]], [qw[ abcdef abc ]], [qw[ abcdef abbbbf ]]; __END__ 1 -1 3
    However, you could still cram this hamming function inside of a regex, with some trickery/cheating:
    # untested my $target = "abcdef"; my $distance = 1; my $len = length $target; my $qr = qr/ \b(\w{$target})\b (?(??{ hamming($target,$1) < $distance + }) | (?!)) /x
    I'm in a bit of a rush at the moment, so my (?(cond)pattern) syntax is probably wrong. Also, it could be optimized greatly to avoid backtracking (say, with (?>pattern) to capture $1). Maybe someone can help me out here with the details, but this should give some idea..


Re: searching for a string w/ a * in any single position?
by GrandFather (Sage) on Oct 05, 2007 at 21:46 UTC

    Tell us about the bigger problem. Not your solution, but the problem it solves. For example, could it be that you are trying to search for a word that may be misspelled?

    Note that your *{1} syntax is not (Perl) regular expression syntax so it's not clear if you mean 'match any single character' or something else. Rather than an invented (or non-Perl regex) syntax, you should either use a correct Perl regex syntax or a sample of matching and mismatching strings (both would be even better).

    Perl is environmentally friendly - it saves trees
      The bigger problem... I have a program that allows users to search text files in FASTA format for an arbitrary number of strings of arbitrary length (motifs of DNA nucleotides to be specific). These strings are typically 6-15 characters long.

      I have it currently set up to perform the searches with m//g an so on, where I have converted the user input strings into a regexp, based on whether they input only A,C,G, or T, or whether they used standard degeneracies, which allow things like specifying 'R' to mean either 'A' or 'G', etc. So, if the user inputs 'ART', the search string is actually "A[R|A|G]T", and so on. In this context, specifying 'N' at any point is equivalent to [A|C|G|T|N].

      What I am looking to do now, is search for whatever user input string, where, in the 5/6 case of my initial example, any 1 character can match N, but the rest of the string must match exactly. Additionally, I am looking to make this an optional feature, not a given of every search, and I want to make the number of N adjustable.

      I hope that clarifies things a bit...


Re: searching for a string w/ a * in any single position?
by mwah (Hermit) on Oct 05, 2007 at 23:21 UTC
    mdunnbassBut, what if i want to be able to match
    any 5 out of the 6 characters in string? i.e., the following
    would all be desired matches to the pattern ...

    This sounds like something from the BLAST
    (Basic Local Alignment Search Tool) domain.

    There are many pages on this topic, one nice is here.
       The initial search is done for a word of 
       length "W" that scores at least "T" when 
       compared to the query using a given 
       substitution matrix.
    Maybe you tell us your problem domain ;-)

Re: searching for a string w/ a * in any single position?
by Cop on Oct 05, 2007 at 22:26 UTC

    The best solution is not to come up with a single regexp, but have a solution to generate those regexp's. A big benefit: the code will be reusable.

      I think the following will do the trick:

      my @chars = map quotemeta, split //, $word; my $re = join '|', map { local @_ = @chars; $_[$_] = '.'; join '', @_ } 0..$#chars;

      Using Regexp::Assemble should be faster:

      use Regexp::Assemble qw( ); my @chars = map quotemeta, split //, $word; my $ra = Regexp::Assemble->new(); for (0..$#chars) { local @_ = @chars; $_[$_] = '.'; $ra->add(join '', @_) } my $re = $ra->re();
        Generalizing this approach for Hamming distance > 1, using the combinations iterator from Iterating over combinations:
        sub gen_regex { my ($target, $n) = @_; my @chars = split //, $target; my $ra = Regexp::Assemble->new(); my $iter = combinations( $n => [ 0 .. $#chars ] ); while (my @c = $iter->()) { local @_ = @chars; $_[$_] = '.' for @c; $ra->add( join '', @_ ); } $ra->re(); }


Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://642987]
Approved by Corion
[stevieb]: I've finally added the ability for berrybrew to fetch the list of perls available directly from Strawberry's releases.json file. Instead of pulling from there on every single call, I've added a berrybrew fetch, so it's only updated..
[stevieb]: ...on request. There's some supporting work I need to do, as well as update the docs, but it's in the v1.12 branch if anyone wants to play with it...
[stevieb]: ...issue 62 will track what else needs to be done.

How do I use this? | Other CB clients
Other Users?
Others rifling through the Monastery: (5)
As of 2017-03-31 01:30 GMT
Find Nodes?
    Voting Booth?
    Should Pluto Get Its Planethood Back?

    Results (364 votes). Check out past polls.