Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

string match using with an N in any position

by biobee07 (Novice)
on Nov 18, 2011 at 02:03 UTC ( #938725=perlquestion: print w/ replies, xml ) Need Help??
biobee07 has asked for the wisdom of the Perl Monks concerning the following question:

Hello perl monks and wisefolks,

I need some help with a regular expression I have a file that has the following strings that I need to match (Query strings)

file1: GCGAT, CACGT

The target strings are in file2, against which the query strings need to be matched GNGATNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN GCGANBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB CNCGTNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

The condition for match is that: 1. Each of the query strings should be matched only in the beginning of the string 2. The query strings can have an N at any position which means for each query string eg. GCGAT we can have NCGAT,GNGAT,GCNAT,GCGNT,GCGAN. So any of these strings should be matched with the target strings. How do I make a regular expression that can contain all the 6 possibilities(includes the original string, GCGAT). I have the following code so far:

# reading in each query string (file1) into an array while($line1= <INP1>){ chomp($line1); push (@barcode,$line1); } foreach $code(@barcode){ my $filename = $code; open(OUT, ">$filename") || die "$!\n"; for my $data(keys %idhash){ # I have stored each of the query stri +ngs in a hash. The value of the hash contains the target strings. The + keys are 1,2,3... my $value = $idhash{$data}; chomp($code); if($value =~ m/^$code/) # HOW DO I WRITE A REGULAR EXPRESSION HERE TO +ALLOW ALL THE 6 COMBINATIONS TO BE MATCHED PER QUERY STRING. { print "$idhash{data}\n"; # where the query string matches the +target string print value of the hash } } }
Thanks in advance, biobee

Comment on string match using with an N in any position
Download Code
Re: string match using with an N in any position
by roboticus (Canon) on Nov 18, 2011 at 02:55 UTC

    biobee07:

    Update: My solution just wasn't as good as the others in this thread. It looks like BrowserUk has the best solution.

    ...roboticus

    When your only tool is a hammer, all problems look like your thumb.

      /^[NG][NC][NG][NA][NT]/ will also match the string 'NNNNN'

      Thanks for replying to my thread -:)
Re: string match using with an N in any position
by jwkrahn (Monsignor) on Nov 18, 2011 at 02:57 UTC

    You could probably do something like this:

    # reading in each query string (file1) into an array my @barcode; while ( my $line1 = <INP1> ) { chomp $line1; my @temp = ( $line1, ( $line1 ) x length $line1 ); substr $temp[ $_ ], $_, 1, 'N' for 0 .. $#temp - 1; local $" = '|'; push @barcode, [ $line1, qr/@temp/ ]; } for my $code ( @barcode ) { my $filename = $code->[ 0 ]; open OUT, '>', $filename or die "Cannot open '$filename' because: +$!"; # I have stored each of the query strings in a hash. # The value of the hash contains the target strings. # The keys are 1,2,3... for my $value ( values %idhash ) { if ( $value =~ /^$code->[1]/ ) { # where the query string matches the target # string print value of the hash print "$value\n"; } } }
Re: string match using with an N in any position
by Marshall (Prior) on Nov 18, 2011 at 03:08 UTC
    There are a number of ways to proceed. I would consider a dynamic regex. Bascially write a little subroutine that writes the regex! /^$regex/ where $regex winds up being something like: "NCGAT|GNGAT|GCNAT|GCGNT|GCGAN".

    You can use substr or whatever in the code that lists out the combinations. When the regex engine compiles this automagically built regex, it can deal with it very efficiently even if there are a lot of "OR" terms because there isn't any backtracking or forward looking or anything fancy.

    So I am suggesting: think simple regex that is program generated. Oh, I don't see any reason for the search patterns to be in hash keyed by a sequential number, why not just have an array?

    Update: I saw BrowserUk's solution and I like it. I wasn't aware that you could calculate the XOR byte by byte of 2 strings in a single Perl operation. Cool. There are some approximate pattern matching algorithms (eg agrep) where this is useful and the manner in which BrowserUk is using the XOR function is apropos.

    Perl has made some very significant enhancements to the regex engine as of late. I am still on Perl 5.10. What I am wondering (I don't know the answer) is how sophisticated the regex compiler has become in 5.14 and what its optimization points are. For example if the data to matched against starts with 'C', then none of the optional patterns will ever match and all of the rest of the calculations become irrelevant - first character isn't a 'G' or 'N' so: stop! If it does that then the result is likely to be faster than the procedure of XOR'ing the strings together and going thru the resultant XOR produced string to count the 0's (fewer character by character operations get done on average vs a smarter "give up early" strategy..well maybe..).

    What I am suggesting is that if performance matters, benchmarking is definitely in order!

    I have used the above algorithm in other situations where for example certain character positions can be swapped and others not. The requirement described in the OP is at the lower limit of what this "write a program to write a program" approach can do (the regex is essentially a program that is compiled by the regex engine at run time). Next year I am going to attempt a performance increase on my regex writing code - I'll report back if I find something particularly stunning.

      Here's a snippet of code that implements what you describe:
      sub fuzzy { my $string = shift; my @alt; for(my $i = 0; $i < length($string); $i++) { my $alt = $string; substr($alt, $i, 1) = 'N'; push @alt, $alt; } return join '|', $string, @alt; # original too } print fuzzy('GCGAT');
      You can just do
      $re = fuzzy($string); if($value =~ /^($re)/) { ... }
      Note the parens for grouping.

      If there's only one value for $re during the run of the file, you can use /o (as in /^($re)/o) but in a modern perl I don't think that makes much difference.

        Yes, this is the idea!
        I was out of town and using a friend's laptop without Perl installed and I was reluctant to post "untested code". This idea applies also in much more complex situations. I have one function that can generate regex'es with 4-12 terms from very similar length "search for" strings due to the "rules".

        Anyway if the "rules" can be described as a regex generation algorithm, then I am suggesting in this thread an "additional tool to add to the Perl toolbox".

        This had the added benefit of when the end user thinks that something should have either matched or not matched, the test output of the module can print regex'es for test cases (mine does). And the terms are simple enough that a user familiar with Windows '.', and '*' wildcards can tell me some new case to add or delete.

        This "write the regex" code replaced what my working group came to call "the regex from hell" - so complicated that you might have to read Mastering Regular Expressions by Jeffrey-Friedl several times before you could understand it!

Re: string match using with an N in any position
by BrowserUk (Pope) on Nov 18, 2011 at 03:08 UTC

    Regexes aren't always the best solution to fuzzy matching:

    #! perl -slw use strict; my @queries = qw[ GCGAT CACGT ]; chomp( my @targets = <DATA> ); for my $q ( @queries ) { for my $t ( @targets ) { my $matched = ( $q ^ substr( $t, 0, length( $q ) ) ) =~ tr[\0] +[\0]; if( $matched >= ( length( $q ) -1 ) ) { print "$q matched $t"; } } } __DATA__ GNGATNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN +NNNNNNNNNNNNNNNNNNNNNNNNNNNNNN GCGANBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB +BBBBBBBBBBBBBBBBBBBBBBBBBBBBBB CNCGTNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN +NNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

    Gives:

    [ 2:16:30.49] c:\test>junk77 GCGAT matched GNGATNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN +NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN GCGAT matched GCGANBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB +BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB CACGT matched CNCGTNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN +NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Hi BrowserUk, You code has worked perfectly for my problem!!! I agree that my problem was one that needed more than just regular expression. However I was struggling to get an elegant solution in perl. Could I please ask you to explain me these two lines:
      my $matched = ( $q ^ substr( $t, 0, length( $q ) ) ) =~ tr[\0] +[\0]; if( $matched >= ( length( $q ) -1 ) )

        Sure.

        During the first iteration, the first of those lines will come down to:

        my $matched = ( 'GCGAT' ^ 'GNGAT' ) =~ tr[\0][\0];

        In words, the first of the query strings is exclusive-Or'd (^) with the first 5 characters of the first target string.

        Where the aligned characters in those two string are the same, the result of xoring them will be zero. Where the two characters differ, the result will be non-zero:

        $xord = 'GCGAT' ^ 'GNGAT';; print unpack 'C*', $xord;; 0 13 0 0 0

        In order to know how many characters matched, all we need to do is count the zeros, which is efficiently done using the transliteration operator:

        $matched = ( 'GCGAT' ^ 'GNGAT' ) =~ tr[\0][\0];; print $matched;; 4

        The result of the first line is that $matched is assigned the number of characters that matched between the two strings.

        As you want all characters, except one in any position, to match, the second line tests the number that matched against the length of the query string minus 1:

        if( $matched >= ( length( $q ) -1 ) ) { ## we have a winner

        If you know all your query strings are always five characters, then you could hard code the value 4 instead of ( length( $q ) -1 ). Ditto for the usage of length in the substr, but it never hurts to be flexible if there is little cost involved in doing so.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

      The OP seems to want to match at the start of the target string with either

      • an exact match to the query string, or
      • a match to a string formed by replacing a single character anywhere in the query string with a single 'N'.
      The code of Re: string match using with an N in any position matches 'CACGT' against 'CBCGTNNN' ('B' vice 'N').

      >perl -wMstrict -le "my @queries = qw[ GCGAT CACGT ]; ;; my @targets = qw(GNGATNNN GCGANBBB CNCGTNNN CBCGTNNN ); ;; for my $q (@queries) { for my $t (@targets) { my $matched = ($q ^ substr($t, 0, length $q)) =~ tr[\0][\0]; if($matched >= (length($q) - 1)) { print qq{'$q' matched '$t'}; } } } " 'GCGAT' matched 'GNGATNNN' 'GCGAT' matched 'GCGANBBB' 'CACGT' matched 'CNCGTNNN' 'CACGT' matched 'CBCGTNNN'

      Here's a variation that avoids this (although the conditional logic is a bit obscure).

      >perl -wMstrict -le "use List::MoreUtils qw(uniq); ;; my @queries = qw(GCGAT CACGTT); ;; my $n_diff = join '', uniq map { sprintf '\x%02x', ord($_ ^ 'N') } map { split // } @queries ; $n_diff = eval qq{ sub { return \$_[0] =~ tr/$n_diff/$n_diff/; } }; ;; my @targets = qw( GNGATNNNHIT GCGANBBBHIT CNCGTTNNNHIT CACGTTNNNHIT CBCGTTNNNMISS CNNGTTNNNMISS NCACGTTNNNMISS ); ;; for my $q (@queries) { my $len_q = length $q; TARGET: for my $t (@targets) { my $mask = $q ^ substr $t, 0, $len_q; my $nulls = $mask =~ tr{\0}{\0}; next TARGET if $len_q > $nulls + 1 or $len_q > $nulls && $n_diff->($mask) != 1 ; print qq{'$q' matched '$t'}; } } " 'GCGAT' matched 'GNGATNNNHIT' 'GCGAT' matched 'GCGANBBBHIT' 'CACGTT' matched 'CNCGTTNNNHIT' 'CACGTT' matched 'CACGTTNNNHIT'
        The code of Re: string match using with an N in any position matches 'CACGT' against 'CBCGTNNN' ('B' vice 'N').

        Agreed. But in the genomic encoding scheme of things, the 'N' means 'aNy'. Whereas 'B' means 'any except A'.

        With my very limited understanding, 'N' therefore encompasses 'B' in as much as there is no mention in his post of excluding strings that have an 'A' in the wild-card position. Nor is there any mention in the post of the possibility of "targets"(*) ever contains 'B's in the relevant positions.

        In this case, the OP seems satisfied with the solution for his particular problem. I'll leave it up to him to know his data and problem domain.

        (*An unusual term in this context -- the wild-cards are usually in the query -- but whatever :)


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        If the possibility your describe can happen, I think this is a computationally simpler solution:

        #! perl -slw use strict; my @queries = qw[ GCGAT CACGT ]; chomp( my @targets = <DATA> ); for my $q ( @queries ) { for my $t ( @targets ) { my $matched = ( $q ^ substr( $t, 0, length( $q ) ) ) =~ tr[\0] +[\0]; if( $matched == length( $q ) or $matched == length( $q )-1 and substr( $t, 0, length( $q ) ) =~ tr[N][N] == 1 ) { print "$q matched $t"; } } } __DATA__ GNGATNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN +NNNNNNNNNNNNNNNNNNNNNNNNNNNNNN GCGANBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB +BBBBBBBBBBBBBBBBBBBBBBBBBBBBBB CNCGTNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN +NNNNNNNNNNNNNNNNNNNNNNNNNNNNNN GBGATNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN +NNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: string match using with an N in any position
by salva (Monsignor) on Nov 18, 2011 at 07:34 UTC
    If the matching strings are all of the same length:
    my %match = map { $_ => 1 } qw(GCGAT CACGT); while (<>) { my $line = $_; s/N//; # remove the first N in the string print "match: $line" if $match{substr($_, 0, 5)} }
    otherwise, if they have different lengths:
    my $pattern = join('|', qw(GCGAT CACGT)); $pattern = qr/^(?:$pattern)/; while (<>) { my $line = $_; s/N//; # remove the first N in the string print "match: $line" if $_ =~ $pattern; }
Re: string match using with an N in any position
by JavaFan (Canon) on Nov 18, 2011 at 10:04 UTC
    Just be pragmatic:
    my $str = "GCGAT"; my @PATS = map { my $s = $str; substr $s, $_, 1, "N"; $s; } 0 .. length($str) - 1; push @PATS, $str; foreach my $str (@PATS) { if (index $value, $str) == 0) { print "Oh, joy, it's a match"; } }
    There's no reason to be clever. But do note the above code is untested.
Re: string match using with an N in any position
by pvaldes (Chaplain) on Nov 18, 2011 at 18:10 UTC

    Just one more, not necessarily better or quicker than the other solutions

    if($value =~ m/^(NCGAT|GNGAT|GCNAT|GCGNT|GCGAN|NACGT|CNCGT|CANGT|CACNT|CACGTN).*?$/){do something...}

    1. Each of the query strings should be matched only in the beginning of the string 2. The query strings can have an N (and ONLY ONE?) at any position which means for each query string eg. GCGAT we can have NCGAT,GNGAT,GCNAT,GCGNT,GCGAN.

      Why the ".*?$"?

        Just a matter of personal preferences. It will helps me to remember what I was doing in future reviews, but you can avoid it if you want (and gain a little speed).

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://938725]
Approved by ww
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others cooling their heels in the Monastery: (8)
As of 2014-08-20 06:04 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The best computer themed movie is:











    Results (105 votes), past polls