Beefy Boxes and Bandwidth Generously Provided by pair Networks vroom
"be consistent"
 
PerlMonks  

Re: string match using with an N in any position

by BrowserUk (Pope)
on Nov 18, 2011 at 03:08 UTC ( #938734=note: print w/ replies, xml ) Need Help??


in reply to string match using with an N in any position

Regexes aren't always the best solution to fuzzy matching:

#! perl -slw use strict; my @queries = qw[ GCGAT CACGT ]; chomp( my @targets = <DATA> ); for my $q ( @queries ) { for my $t ( @targets ) { my $matched = ( $q ^ substr( $t, 0, length( $q ) ) ) =~ tr[\0] +[\0]; if( $matched >= ( length( $q ) -1 ) ) { print "$q matched $t"; } } } __DATA__ GNGATNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN +NNNNNNNNNNNNNNNNNNNNNNNNNNNNNN GCGANBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB +BBBBBBBBBBBBBBBBBBBBBBBBBBBBBB CNCGTNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN +NNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

Gives:

[ 2:16:30.49] c:\test>junk77 GCGAT matched GNGATNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN +NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN GCGAT matched GCGANBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB +BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB CACGT matched CNCGTNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN +NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.


Comment on Re: string match using with an N in any position
Select or Download Code
Re^2: string match using with an N in any position
by biobee07 (Novice) on Nov 18, 2011 at 04:31 UTC
    Hi BrowserUk, You code has worked perfectly for my problem!!! I agree that my problem was one that needed more than just regular expression. However I was struggling to get an elegant solution in perl. Could I please ask you to explain me these two lines:
    my $matched = ( $q ^ substr( $t, 0, length( $q ) ) ) =~ tr[\0] +[\0]; if( $matched >= ( length( $q ) -1 ) )

      Sure.

      During the first iteration, the first of those lines will come down to:

      my $matched = ( 'GCGAT' ^ 'GNGAT' ) =~ tr[\0][\0];

      In words, the first of the query strings is exclusive-Or'd (^) with the first 5 characters of the first target string.

      Where the aligned characters in those two string are the same, the result of xoring them will be zero. Where the two characters differ, the result will be non-zero:

      $xord = 'GCGAT' ^ 'GNGAT';; print unpack 'C*', $xord;; 0 13 0 0 0

      In order to know how many characters matched, all we need to do is count the zeros, which is efficiently done using the transliteration operator:

      $matched = ( 'GCGAT' ^ 'GNGAT' ) =~ tr[\0][\0];; print $matched;; 4

      The result of the first line is that $matched is assigned the number of characters that matched between the two strings.

      As you want all characters, except one in any position, to match, the second line tests the number that matched against the length of the query string minus 1:

      if( $matched >= ( length( $q ) -1 ) ) { ## we have a winner

      If you know all your query strings are always five characters, then you could hard code the value 4 instead of ( length( $q ) -1 ). Ditto for the usage of length in the substr, but it never hurts to be flexible if there is little cost involved in doing so.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
Re^2: string match using with an N in any position
by AnomalousMonk (Monsignor) on Nov 18, 2011 at 22:54 UTC

    The OP seems to want to match at the start of the target string with either

    • an exact match to the query string, or
    • a match to a string formed by replacing a single character anywhere in the query string with a single 'N'.
    The code of Re: string match using with an N in any position matches 'CACGT' against 'CBCGTNNN' ('B' vice 'N').

    >perl -wMstrict -le "my @queries = qw[ GCGAT CACGT ]; ;; my @targets = qw(GNGATNNN GCGANBBB CNCGTNNN CBCGTNNN ); ;; for my $q (@queries) { for my $t (@targets) { my $matched = ($q ^ substr($t, 0, length $q)) =~ tr[\0][\0]; if($matched >= (length($q) - 1)) { print qq{'$q' matched '$t'}; } } } " 'GCGAT' matched 'GNGATNNN' 'GCGAT' matched 'GCGANBBB' 'CACGT' matched 'CNCGTNNN' 'CACGT' matched 'CBCGTNNN'

    Here's a variation that avoids this (although the conditional logic is a bit obscure).

    >perl -wMstrict -le "use List::MoreUtils qw(uniq); ;; my @queries = qw(GCGAT CACGTT); ;; my $n_diff = join '', uniq map { sprintf '\x%02x', ord($_ ^ 'N') } map { split // } @queries ; $n_diff = eval qq{ sub { return \$_[0] =~ tr/$n_diff/$n_diff/; } }; ;; my @targets = qw( GNGATNNNHIT GCGANBBBHIT CNCGTTNNNHIT CACGTTNNNHIT CBCGTTNNNMISS CNNGTTNNNMISS NCACGTTNNNMISS ); ;; for my $q (@queries) { my $len_q = length $q; TARGET: for my $t (@targets) { my $mask = $q ^ substr $t, 0, $len_q; my $nulls = $mask =~ tr{\0}{\0}; next TARGET if $len_q > $nulls + 1 or $len_q > $nulls && $n_diff->($mask) != 1 ; print qq{'$q' matched '$t'}; } } " 'GCGAT' matched 'GNGATNNNHIT' 'GCGAT' matched 'GCGANBBBHIT' 'CACGTT' matched 'CNCGTTNNNHIT' 'CACGTT' matched 'CACGTTNNNHIT'
      The code of Re: string match using with an N in any position matches 'CACGT' against 'CBCGTNNN' ('B' vice 'N').

      Agreed. But in the genomic encoding scheme of things, the 'N' means 'aNy'. Whereas 'B' means 'any except A'.

      With my very limited understanding, 'N' therefore encompasses 'B' in as much as there is no mention in his post of excluding strings that have an 'A' in the wild-card position. Nor is there any mention in the post of the possibility of "targets"(*) ever contains 'B's in the relevant positions.

      In this case, the OP seems satisfied with the solution for his particular problem. I'll leave it up to him to know his data and problem domain.

      (*An unusual term in this context -- the wild-cards are usually in the query -- but whatever :)


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        ... in ... genomic encoding ... 'N' means 'aNy' [w]hereas 'B' means 'any except A'.

        I did not know that.

        With my very limited understanding...

        Well, your understanding is nowhere near as limited as mine and, as you say, the OPer seems happy, so...

      If the possibility your describe can happen, I think this is a computationally simpler solution:

      #! perl -slw use strict; my @queries = qw[ GCGAT CACGT ]; chomp( my @targets = <DATA> ); for my $q ( @queries ) { for my $t ( @targets ) { my $matched = ( $q ^ substr( $t, 0, length( $q ) ) ) =~ tr[\0] +[\0]; if( $matched == length( $q ) or $matched == length( $q )-1 and substr( $t, 0, length( $q ) ) =~ tr[N][N] == 1 ) { print "$q matched $t"; } } } __DATA__ GNGATNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN +NNNNNNNNNNNNNNNNNNNNNNNNNNNNNN GCGANBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB +BBBBBBBBBBBBBBBBBBBBBBBBBBBBBB CNCGTNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN +NNNNNNNNNNNNNNNNNNNNNNNNNNNNNN GBGATNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN +NNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://938734]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others rifling through the Monastery: (5)
As of 2014-04-21 01:25 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    April first is:







    Results (489 votes), past polls