Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Comparison by position and value

by BrowserUk (Patriarch)
on Jan 02, 2005 at 09:32 UTC ( [id://418761]=perlquestion: print w/replies, xml ) Need Help??

BrowserUk has asked for the wisdom of the Perl Monks concerning the following question:

Given two strings of equal length that contain digits and some placeholder value, I want a fast way to determine if the pair are 'compatible'.

Eg. Given (where underscore represents the placeholder but this could be substituted by any other non-digit value that helped the algorithm)

_8__3__19 48____7__

These two strings are compatible because no single digit appears in both strings except where it appears in the same position (8)

Whereas these two would be incompatible

_8__3__19 4_8___7__

because the digit 8 appears in both, but at a different position. And these two are incompatible

_8__3__19 48_____7_

because the second last digit in both strings contains a different value.

I know how to do this with a loop and substr and a hash, but it is rather slow and I have a set N of M sets of these strings and I want to produce a set P, where each element of P is a string that is combines one string from as many of the N sets of strings as are compatible. Given the combinatorial nature of the problem, I need to make it as quick as possible.

I keep thinking that this can be done using string-wise boolean operations, but I cannot see how?

If there is a better way of structuring the data (eg. arrays of char rather than strings) that makes the process quicker or easier that's good to.


Examine what is said, not who speaks.
Silence betokens consent.
Love the truth but pardon error.

Replies are listed 'Best First'.
Re: Comparison by position and value
by ccn (Vicar) on Jan 02, 2005 at 11:29 UTC

    having spaces instead of underscores:

    #!/usr/bin/perl -wl use strict; sub is_compatible { local $_ = $_[0] ^ $_[1]; return not (/[\001-\017]/ or /([\020-\031]).*?\1/s); } print is_compatible (' 8 3 19', '48 7 ') ? 'yes' : 'no'; print is_compatible (' 8 3 19', '4 8 7 ') ? 'yes' : 'no'; print is_compatible (' 8 3 19', '48 7 ') ? 'yes' : 'no'; __END__ # output yes no no
Re: Comparison by position and value
by gaal (Parson) on Jan 02, 2005 at 15:41 UTC

    This is essentially the same as substr and a hash, but will run faster. Forgive the funky pseudocode.

    sub is_compatible { my T, B; foreach i from 0 .. stringlen { next if top[i] eq bottom[i] if top[i] is a digit return NOT COMPATIBLE if bottom[i] is a digit T |= top[i] if bottom[i] is a digit B |= bottom[i] } return T&B ? NOT COMPATIBLE : COMPATIBLE }

    Assumptions: your "digits" are weakly fewer in number than your integer1 width. If by digit you really do mean 0 .. 9, I think this obtains on all the machines Perl runs on. In which case you can also optimize "is a digit" with low-level ASCII checks. Also, you should probably only split each string once instead of seeking into it. If this isn't fast enough, this is a good candidate for inlining in c.

    1 More precisely, whatever does efficient bitwise arithmetic.

Re: Comparison by position and value
by aquarium (Curate) on Jan 02, 2005 at 10:23 UTC
    pseudocode: make each of the pair of strings an array step through first array, each position you find a underscore, put the underscore in the second array in same position. now step through second array, and in each position you find a underscore, put one at the same position in first array. join each array into a scalar and compare both scalars with string comparison operator (eq) the initial traversing of both arrays can be easily done in one loop. i just wrote it the (2 loop) way to more clearly illustrate.
    the hardest line to type correctly is: stty erase ^H
Re: Comparison by position and value
by ambrus (Abbot) on Jan 02, 2005 at 11:08 UTC
    sub compatible { my ($s1, $s2) = @_; y/_/\0/c for $s1, $s2; !(($s1 | $s2)=~y/\0//); } print(compatible("_8__3__19", "48_____7_") ? "true\n" : "false\n");

    Update: BrowserUK has pointed out that this is wrong. Indeed, it does not work if there is the same digit in both positions of the same string.

    Update: Corrected (I hope) code

    sub compatible { y/_/\477/c, y/_/\0/ for my @m = @_; !((($_[0] ^ $_[1]) & $m[0] & $m[1]) =~ y/\0//c); } printf "%s v %s ? %s\n", @$_, compatible( @$_ ) ? 1 : 0 for [ qw[ _8__3__19 48____7__ ] ], # compat [ qw[ _8__3__19 4_8___7__ ] ], # compat [ qw[ _8__3__19 48_____7_ ] ]; # clash

    Update: Oh, I've got it wrong again. forget this.

      That doesn't work as is?

      #! perl -slw use strict; sub compatible { my ($s1, $s2) = @_; y/_/\0/c for $s1, $s2; !(($s1 | $s2)=~y/\0//); } printf "%s v %s ? %s\n", @$_, compatible( @$_ ) ? 1 : 0 for [ qw[ _8__3__19 48____7__ ] ], # good [ qw[ _8__3__19 4_8___7__ ] ], # bad [ qw[ _8__3__19 48_____7_ ] ]; # bad __END__ [11:18:29.65] P:\test>test _8__3__19 v 48____7__ ? 0 _8__3__19 v 4_8___7__ ? 1 _8__3__19 v 48_____7_ ? 0

      Examine what is said, not who speaks.
      Silence betokens consent.
      Love the truth but pardon error.

        I don't see your problem:

        _8__3__19 v 48____7__ ? 0 ^ _8__3__19 v 4_8___7__ ? 1 _8__3__19 v 48_____7_ ? 0 ^ ^

        For these strings at least, the result is 0 iff there is a clash of two digits somewhere.

Re: Comparison by position and value
by jbrugger (Parson) on Jan 02, 2005 at 15:36 UTC
    The 'subcode' indeed would work, it looks something like:

    #!/usr/bin/perl use strict; sub compare ($$) { my ($a,$b) = @_; my $ok = 0; for (my $i=0; $i < scalar(@{$a}); $i++) { if ( @{$a}[$i] eq @{$b}[$i] && @{$a}[$i] ne "_" ) { #compatible... $ok=1; } elsif (@{$a}[$i] ne @{$b}[$i] && (@{$a}[$i] ne "_" && @{$b}[$i] ne "_") ) { # totally not compatible return 0; } } $ok; } my @a1 = qw(_ 8 _ _ 3 _ _ 1 9); my @a2 = qw(2 _ 8 _ _ _ 7 _ _); my @a3 = qw(4 8 _ _ _ _ _ 7 _); my @a4 = qw(4 8 _ _ _ _ 7 _ _); print ( "1: " . compare(\@a1, \@a2) ."\n" ); print ( "2: " . compare(\@a1, \@a3) ."\n" ); print ( "3: " . compare(\@a1, \@a4) ."\n" );
Re: Comparison by position and value
by ambrus (Abbot) on Jan 02, 2005 at 11:21 UTC
    sub compatible { my ($s1, $s2) = @_; $s1=~/\G_/gcs and $s2=~/./gcs or $s1=~/./gcs and $s2=~/\G_/gcs or return 0 until $s1=~/\G\z/ or $s2=~/\G\z/; 1; } print(compatible("_8__3__19", "4_8___7__") ? "true\n" : "false\n");

    Update: this is wrong too, for the same reason as my other reply. Wait, I'll try to post a correction.

    Update: Maybe this would work:

    sub compatible { my ($s1, $s2) = @_; my $t; $s1=~/\G_/gcs and $s2=~/./gcs or $s1=~/(.)/gcs and $t = $1, $s2=~/\G_/gcs or $s2=~/(.)/gcs and +$t eq $1 or return 0 until $s1=~/\G\z/ or $s2=~/\G\z/; 1; } printf "%s v %s ? %s\n", @$_, compatible( @$_ ) ? 1 : 0 for [ qw[ _8__3__19 48____7__ ] ], # compat [ qw[ _8__3__19 4_8___7__ ] ], # compat [ qw[ _8__3__19 48_____7_ ] ]; # clash

    Update: this is wrong too.

    Update:

    sub compatible { my ($s1, $s2) = @_; my($c1, $c2, $d); { $s1 =~ /(.)/gs or return 1; $c1 = $1; $s2 =~ /(.)/gs or return 1; $c2 = $1; if ($c1 eq "_") { if ($c2 eq "_") { } else { vec($d, ord $c2, 1)++ and return; } } else { if ($c2 eq "_") { vec($d, ord $c1, 1)++ and return; } else { $c1 ne $c2 and return; vec($d, ord $c2, 1)++ + and return; } } redo; } } printf "%s v %s ? %s\n", @$_, compatible( @$_ ) || 0 for [ qw[ _8__3__19 48____7__ ] ], # compat [ qw[ _8__3__19 4_2___7__ ] ], # compat [ qw[ _8__3__19 4_8___7__ ] ], # clash [ qw[ __8_3__19 48____7__ ] ], # clash [ qw[ __8_3__19 84____7__ ] ], # clash [ qw[ _8__3__19 48_____7_ ] ]; # clash
Re: Comparison by position and value
by ambrus (Abbot) on Jan 02, 2005 at 12:03 UTC
    sub compatible { my ($s1, $s2) = @_; my $m = length($s1) - 1; ($s1 . $s2) !~ /^.{0,$m}?([^_]).{$m}(?!\1)[^_]/; } printf "%s v %s ? %s\n", @$_, compatible( @$_ ) ? 1 : 0 for [ qw[ _8__3__19 48____7__ ] ], # compat [ qw[ _8__3__19 4_8___7__ ] ], # compat [ qw[ _8__3__19 48_____7_ ] ]; # clash

    Update:And this is wrong too.

    Update:

    sub compatible { my ($s1, $s2) = @_; my $m = length($s1) - 1; my($n, $p) = ($m - 1, $m + 1); ($s1 . $s2) !~ /^.{0,$m}?([^_]).{$m}(?!\1)[^_]/ and ($s1 . $s2) !~ /^.{0,$m}?([^_])(?:.{0,$n}|.{$p,})\1/; } printf "%s v %s ? %s\n", @$_, compatible( @$_ ) ? 1 : 0 for [ qw[ _8__3__19 48____7__ ] ], # compat [ qw[ _8__3__19 4_8___7__ ] ], # clash [ qw[ __8_3__19 48____7__ ] ], # clash [ qw[ __8_3__19 84____7__ ] ], # clash [ qw[ _8__3__19 48_____7_ ] ]; # clash
Re: Comparison by position and value
by Aristotle (Chancellor) on Jan 02, 2005 at 23:40 UTC

    I keep thinking that this can be done using string-wise boolean operations, but I cannot see how?

    Like this:

    sub compatible { my( $l, $r ) = @_; # underscores are insignificant tr/_/\0/ for $l, $r; # cancel out identical values my $xor = $l ^ $r; # convert to bitmasks tr/\0/\377/c for $l, $r; my $mask = $l & $r; # masked chars must be identical return !1 if ( $xor & $mask ) =~ tr/\0//c; # and there may not be dupes of non-identical characters return 0 == grep { my $char = substr( $xor, $_, 1 ); $char ne "\0" and index( $xor, $char, $_ + 1 ) != -1 } 0 .. length( $xor ) - 1; }

    Test suite in the readmore.

    Makeshifts last the longest.

      Your code appears to fail on '__8_3__19' and '84____7__', passing this as compatible when it should not be as the digit 8 appears in both strings in different positions.

      It could be that in moving your function into my test script, I have broken it, but as your post doesn't contain a stand-alone testcase that I can run, I cannot verify this conjecture.


      Examine what is said, not who speaks.
      Silence betokens consent.
      Love the truth but pardon error.

        Are you sure it's my posted code? I just dropped your example into my test suite as an incompatible case and it passes.

        Makeshifts last the longest.

Re: Comparison by position and value
by steves (Curate) on Jan 02, 2005 at 16:54 UTC

    Update: I didn't read closely enough. This doesn't account for the case where the same digit appears in different positions in each sequence -- i.e., I'm only comparing each individual position.

    Update 2: Here's a more complicated version that accounts for what I missed, the assumption still being that a given digit can only appear once in a sequence.

    sub compatible { my ($s1, $s2) = @_; my %hashed; my $diff = 0; my ($one, $two); my @counts; my $n = 10; # # Hash so that each $s1 value is a key with its corresponding # $s2 value as the hashed value. # @hashed{map {($_ eq '_') ? $n++ : $_} split("", $s1)} = map {($_ eq '_') ? $n++ : $_} split("", $s2); # # Compare each $s1->$s2 pair # while (($one, $two) = each %hashed) { next if ($one == $two); last if (($diff = (++$counts[$one] - 1)) > 0); last if (($diff = (++$counts[$two] - 1)) > 0); $diff = 0; next if ( ($one >= 10) || ($two >= 10) ); last if (($diff = ($one - $two)) != 0); } print "$s1\n$s2\n", ($diff == 0) ? "compatible" : "incompatible", +"\n\n"; }
    Revised output:
    _8__3__19 48____7__ compatible _8__3__19 4_2___7__ compatible _8__3__19 4_8___7__ incompatible __8_3__19 48____7__ incompatible __8_3__19 84____7__ incompatible _8__3__19 48_____7_ incompatible

    I'm not sure how fast this is compared to the other methods, but here's a way to do it using a hash and a short circuited loop comparison of each hashed key/value:

    use strict; sub compatible { my ($s1, $s2) = @_; my %hashed; my $diff = 0; my ($one, $two); # # Hash so that each $s1 value is a key with its corresponding # $s2 value as the hashed value. # @hashed{split("", $s1)} = split("", $s2); # # Compare each $s1->$s2 pair # while (($one, $two) = each %hashed) { next if ( ($one eq '_') || ($two eq '_') ); last if (($diff = ($one - $two)) != 0); } print "$s1\n$s2\n", ($diff == 0) ? "compatible" : "incompatible", +"\n\n"; } my @tests = (qw/ _8__3__19 48____7__ _8__3__19 4_2___7__ _8__3__19 4_8___7__ __8_3__19 48____7__ __8_3__19 84____7__ _8__3__19 48_____7_ / ); my ($s1, $s2); while (defined($s1 = shift(@tests))) { $s2 = shift(@tests); compatible($s1, $s2); }
    Output:
    _8__3__19 48____7__ compatible _8__3__19 4_2___7__ compatible _8__3__19 4_8___7__ compatible __8_3__19 48____7__ compatible __8_3__19 84____7__ compatible _8__3__19 48_____7_ incompatible
      I thought the following are supposed to be incompatible becuase the digit "8" occurs in both strings but at different locations:
      _8__3__19 4_8___7__ compatible __8_3__19 48____7__ compatible __8_3__19 84____7__ compatible
Re: Comparison by position and value
by holli (Abbot) on Jan 02, 2005 at 19:56 UTC
    that was too tough for me, so i asked a friend of mine (who is a real perl wizard but not very talkative). my own solution was hash-based too. he gave me this:
    use strict; use warnings; print &compatible ("_8__3__19", "48____7__"); #c print &compatible ("_8__3__19", "4_8___7__"); #i print &compatible ("_8__3__19", "48_____7_"); #i sub compatible { my @s=($_[0], $_[1]); my @d; for( 0,1 ) { for( $d[$_]=$s[$_] ) { tr/0-9/\0/c; #assuming the placeholder is not \0 tr/\0/\377/c; } } my $m="$d[0]" & "$d[1]"; for( 0,1 ) { $d[$_] = "$m" & "$s[$_]"; } my $compatible; if( $compatible = $d[0] eq $d[1] ) { for( 0,1 ) { $d[$_] = {}; ${$d[$_]}{$1} .= "$-[0] " while $s[$_] =~ /(\d)/g; } $compatible &= !grep$d[1]->{$_} && $d[1]->{$_} ne $d[0]->{$_}, +keys %{$d[0]}; } return $compatible ? 1 : 0;; }
    i canīt say i understand it fully but normally you can rely on ozo
Re: Comparison by position and value
by sgifford (Prior) on Jan 02, 2005 at 20:35 UTC

    It's the need to check the same digit appearing in different places in the two strings that makes this hard to do with bitwise string operations. But the only reason that bitwise string operations are so fast is because you can avoid making all of those jumps into the Perl interpreter; you can hand it both the strings and an operation, let it groove along in C for awhile, then come back with an answer.

    You can get the same effect with Inline::C.

    With a simple substr-based implementation, a C version is nearly 4 times faster than a nearly identical Perl version:

    Benchmark: timing 100000 iterations of csimple3, simple3... csimple3: 5 wallclock secs ( 4.43 usr + 0.00 sys = 4.43 CPU) @ 22573.36/s (n=100000) simple3: 19 wallclock secs (16.90 usr + 0.01 sys = 16.91 CPU) @ 5913.66/s (n=100000)
    Here's the code I used:
Re: Comparison by position and value
by jdalbec (Deacon) on Jan 03, 2005 at 00:56 UTC
    Update: modified output to display only the original strings as suggested by steves.
    Whereas these two would be incompatible
    _8__3__19 4_8___7__
    because the digit 8 appears in both, but at a different position. And these two are incompatible
    _8__3__19 48_____7_
    because the second last digit in both strings contains a different value.
    The second incompatibility is simple to test for using boolean operations. If we look at the strings as partial permutations, the first incompatibility is equivalent to the second incompatibility on the inverse partial permutations. So maybe a Schwartzian transform is in order. Are all your strings 9 characters long? Are all your digits the numbers 1..9? If so, maybe the following will help:
    sub appendinverse { my $string = shift; my @revarray; for my $i (0..8) { $revarray[0+substr($string, $i, 1)] = $i; } delete $revarray[0]; for my $i (1..9) { if(exists $revarray[$i]) { $string .= $revarray[$i]; } else { $string .= "_"; } } return $string; } # test harness stolen from steves my @tests = (qw/ _8__3__19 48____7__ _8__3__19 4_2___7__ _8__3__19 4_8___7__ __8_3__19 48____7__ __8_3__19 84____7__ _8__3__19 48_____7_ / ); # Schwartzian transform for my $i (@tests) { $i = appendinverse $i; # print "$i\n"; } sub compatible { my $a = shift; my $b = shift; # modified as suggested by steves # print "\n$a\n$b\n"; print "\n",substr($a,0,9),"\n",substr($b,0,9),"\n"; if (($a^$b)=~/[\001-\017]/) { print "incompatible\n"; } else { print "compatible\n"; } } # test harness stolen from steves my ($s1, $s2); while (defined($s1 = shift(@tests))) { $s2 = shift(@tests); compatible($s1, $s2); }
    which outputs:
    _8__3__197_4____18 48____7_____0__61_ compatible _8__3__197_4____18 4_2___7___2_0__6__ compatible _8__3__197_4____18 4_8___7_____0__62_ incompatible __8_3__197_4____28 48____7_____0__61_ incompatible __8_3__197_4____28 84____7_____1__60_ incompatible _8__3__197_4____18 48_____7____0__71_ incompatible
    _8__3__19 48____7__ compatible _8__3__19 4_2___7__ compatible _8__3__19 4_8___7__ incompatible __8_3__19 48____7__ incompatible __8_3__19 84____7__ incompatible _8__3__19 48_____7_ incompatible

      Output is a little misleading since it shows the transformed sequences -- not the originals.

Re: Comparison by position and value
by sgifford (Prior) on Jan 04, 2005 at 08:39 UTC

    Here's a solution that is very fast and manages to use string-wise boolean operations. It uses a data structure that's a little bit complicated, but since converting all the data structures is linear to the number of strings you have and you're doing the comparisons many more times than that, I think it will still offer a significant speedup.

    The basic idea is this:

    It's checking whether a character appears elsewhere in the string that's slow; there's not a straightforward way to do it without looking at each character individually. But if we have a string telling at which position each character occurs, we can use a similar masking technique to see whether the same character occurs in different positions in two strings. We can get this by doing something similar to transposing a one-dimensional matrix.

    Let me illustrate with an example.

    Since we can compare the strings and the "transposed" strings the same way, we can simply concatenate them together and store them with their masks for the data structure. I used an arrayref with [$orig_str, $str_and_xposed, $mask], with $str_and_xposed having underscores changed to \0.

    With this data structure, the entire test becomes:

    # a and mask[b] eq b and mask[a] ($_[0][1] & $_[1][2]) eq ($_[1][1] & $_[0][2]);

    Here's how it benchmarks. simple3 is a simple substr-based implementation, csimple3 is an Inline::C implementation of simple3, clever2 is the above code, clever3 is the same code but doing the data structure transformations beforehand, cclever3 is an Inline::C implementation of clever3. I've actually benchmarked several other promising solutions in this thread, and this is by far the fastest.

    Benchmark: timing 25000 iterations of cclever3, clever2, clever3, csim +ple3, simple3... cclever3: 1 wallclock secs ( 0.95 usr + 0.00 sys = 0.95 CPU) @ 26315.79/s (n=25000) clever2: 10 wallclock secs ( 8.73 usr + 0.02 sys = 8.75 CPU) @ 2857.14/s (n=25000) clever3: 1 wallclock secs ( 1.15 usr + 0.00 sys = 1.15 CPU) @ 21739.13/s (n=25000) csimple3: 1 wallclock secs ( 1.04 usr + -0.01 sys = 1.03 CPU) @ 24271.84/s (n=25000) simple3: 5 wallclock secs ( 3.97 usr + 0.01 sys = 3.98 CPU) @ 6281.41/s (n=25000)

    Here's the code I ran:

Re: Comparison by position and value
by !1 (Hermit) on Jan 02, 2005 at 21:42 UTC

    I kinda doubt it's fast but it certainly is short.

    sub compat { my ($f,$s) = @_; my $neg = "[^".join("",grep $_ ne "_", split//, $f)."]"; $f =~ s/((.)\1*)/$1 eq "_" ? "$neg\{".length($2)."}":"[$1_]"/ge; return $s =~ $f; }
Re: Comparison by position and value
by sgifford (Prior) on Jan 04, 2005 at 09:02 UTC

    For some reason I'm really enjoying this problem.

    I've benchmarked all of the promising-looking solutions here. If you don't see yours and you think it's a contender, let me know, and ideally post your benchmark code and results as a followup to this (or stick it in a scratchpad and /msg me and I'll add it).

    Some of mine use Inline::C; if you don't have it and don't want it, just comment out the multi-line use statement and the Init statement, and remove the benchmarks for sgifford_csimple3 and sgifford_cclever3.

    Results (slightly doctored* for better display):

    ambrus: 3s ( 3.19 usr + 0.01 sys = 3.20 CPU) @ 7812.50/s aristotle: 7s ( 5.71 usr + 0.02 sys = 5.73 CPU) @ 4363.00/s aristotle2: 5s ( 4.56 usr + 0.00 sys = 4.56 CPU) @ 5482.46/s ccn: 2s ( 1.74 usr + 0.00 sys = 1.74 CPU) @ 14367.82/s sgifford_cclever3: 1s ( 0.96 usr + 0.00 sys = 0.96 CPU) @ 26041.67/s sgifford_clever2: 9s ( 8.56 usr + 0.05 sys = 8.61 CPU) @ 2903.60/s sgifford_clever3: 2s ( 1.19 usr + 0.01 sys = 1.20 CPU) @ 20833.33/s sgifford_csimple3: 1s ( 1.04 usr + 0.00 sys = 1.04 CPU) @ 24038.46/s simple3: 5s ( 4.23 usr + 0.01 sys = 4.24 CPU) @ 5896.23/s

    Code follows.

    *Benchmark Doctoring Code:

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://418761]
Approved by ccn
Front-paged by ccn
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others sharing their wisdom with the Monastery: (6)
As of 2024-03-19 10:05 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found