Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

Re: simple string comparison for efficiency

by GrandFather (Saint)
on May 28, 2009 at 21:05 UTC ( [id://766750]=note: print w/replies, xml ) Need Help??


in reply to simple string comparison for efficiency

Tell us about the larger picture. There is likely to be room for improving whatever you are doing at present, or at least for us to target solutions to whatever the bigger problem is. Most gains in performance are not obtained by micro-optimizations such as you are asking for. Some tricks that will give huge performance gains in some situations will cripple the code in other situations.

That said, you can build masks then xor the strings to give something like the fast match you are looking for where the strings to be matched are fairly long. Consider:

use strict; use warnings; my $strA = 'ATGNCNC'; my $strB = 'ATGACNN'; my $strC = 'TTGACNN'; print $strA, (match ($strA, $strB) ? ' eq' : ' ne'), " $strB\n"; print $strA, (match ($strA, $strC) ? ' eq' : ' ne'), " $strC\n"; sub match { my ($mask1, $mask2) = @_; my ($str1, $str2) = @_; $mask1 =~ tr/NATGC/0\xFF/; $mask2 =~ tr/NATGC/0\xFF/; $mask1 &= $mask2; $str1 ^= $str2; $str1 &= $mask1; return $str1 !~ /[^\x00]/; }

Prints:

ATGNCNC eq ATGACNN ATGNCNC ne TTGACNN

If you can cache the masks (say you were matching all strings against all others for example) then you get a greater gain.


True laziness is hard work

Replies are listed 'Best First'.
Re^2: simple string comparison for efficiency
by tybalt89 (Monsignor) on Nov 16, 2024 at 18:39 UTC

    I was wandering around SuperSearch looking for something else when I saw this and wondered "Are the xor's with 'N' different than the xor's with ACGT?". Turns out they are, and so there is no need for any of the masking in this post.

    #!/usr/bin/perl use strict; # https://perlmonks.org/?node_id=766743 use warnings; my %thexor; # check for N mismatches different from non-N mismatches for my $x ( qw( A T G C N ) ) { for my $y ( qw( A T G C N ) ) { $x lt $y and $thexor{$x ^ $y} .= "$x$y "; } } use Data::Dump 'dd', 'pp'; dd \%thexor; # yes they are # mismatch "\2\4\6\23\25\27" # match "\0\t\r\17\32" local $/ = ''; while( <DATA> ) { my ($x, $y) = split; my $bad = ($x ^ $y) =~ tr/\2\4\6\23\25\27//; # therefore this counts + mismatches print "$x ^ $y => ", pp($x ^ $y), $bad ? ' FAIL' : ' ok', "\n"; } __DATA__ ATGNCNC ATGACNN ATGNCNC TTGNNNC

    Outputs:

    { "\2" => "AC ", "\4" => "CG ", "\6" => "AG ", "\t" => "GN ", "\r" => "CN ", "\17" => "AN ", "\23" => "GT ", "\25" => "AT ", "\27" => "CT ", "\32" => "NT ", } ATGNCNC ^ ATGACNN => "\0\0\0\17\0\0\r" ok ATGNCNC ^ TTGNNNC => "\25\0\0\0\r\0\0" FAIL
Re^2: simple string comparison for efficiency
by CaptainF (Initiate) on May 29, 2009 at 00:16 UTC
    Grandfather, your solution using bitwise operators would not have occurred to me, but was exactly what I needed. It solved the problem several orders of magnitude faster than my solution. Is there a simple way to extract the number of string positions where one or both strings had an 'N' from your code?

      Add the line:

      my $nCount = ($mask1 =~ tr/N//) + ($mask2 =~ tr/N//);

      as the third line of sub match.


      True laziness is hard work

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://766750]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others meditating upon the Monastery: (3)
As of 2025-06-18 03:09 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?
    erzuuliAnonymous Monks are no longer allowed to use Super Search, due to an excessive use of this resource by robots.