Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

How to test for sameness on a string of numbers

by willk1980 (Novice)
on Mar 21, 2013 at 02:38 UTC ( #1024651=perlquestion: print w/ replies, xml ) Need Help??
willk1980 has asked for the wisdom of the Perl Monks concerning the following question:

Hi

I have some input from an external system that arrives in my perl code as a string of numbers.

The string themselves equate to 10 digit telephone numbers and I want to check is the same 10 digit numbers are in both strings easily for a later hash key I'm creating.

So for instance if I have the following three 10 digit numbers

5125670001, 5125760002, 5125760003

They could appear as

512567001512567002512567003
512567002512567001512567003
512567003512567001512567002

etc.

In each case I want the strings to equate to the same checksum or equivalent because they each contain the same 3 numbers.

Now I could just split the string down into it's consistuent parts because I know the format, but is there a more creative way to achieve what I'm after? Effectively it's a sameness check. I want to take the output from the sameness check and convert it into a value for my hash key.

So for my hash lookup to work I need the 3 long strings in the example above to convert to the same value, so they each give the same hash key each time.

I was hoping there might be some suggestions on here?

Comment on How to test for sameness on a string of numbers
Re: How to test for sameness on a string of numbers
by CountZero (Bishop) on Mar 21, 2013 at 06:55 UTC
    unpack the string into its three parts, sort the parts and join them again and test the so re-assembled strings. This is a general algorithm: transform variable data into their canonical form and check that.

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

    My blog: Imperial Deltronics
Re: How to test for sameness on a string of numbers
by kcott (Abbot) on Mar 21, 2013 at 06:58 UTC

    G'day willk1980,

    Your sample data contains groupings of 9-digit (not 10-digit) numbers:

    $ perl -Mstrict -Mwarnings -E ' my @originals = (5125670001, 5125760002, 5125760003); my @inputs = qw{ 512567001512567002512567003 512567002512567001512567003 512567003512567001512567002 }; my $canon_original = join q{} => sort @originals; for (@inputs) { my $canon_input = join q{} => sort /(\d{10})/g; say "$_ is ", ($canon_input eq $canon_original) ? "same" : "no +t same"; } ' 512567001512567002512567003 is not same 512567002512567001512567003 is not same 512567003512567001512567002 is not same

    Changing the 00[1-3] to 000[1-3] to make them 10-digit groupings:

    $ perl -Mstrict -Mwarnings -E ' my @originals = (5125670001, 5125760002, 5125760003); my @inputs = qw{ 512567000151256700025125670003 512567000251256700015125670003 512567000351256700015125670002 }; my $canon_original = join q{} => sort @originals; for (@inputs) { my $canon_input = join q{} => sort /(\d{10})/g; say "$_ is ", ($canon_input eq $canon_original) ? "same" : "no +t same"; } ' 512567000151256700025125670003 is not same 512567000251256700015125670003 is not same 512567000351256700015125670002 is not same

    Reversing the order of 67 in each instance of 5125670002 and 5125670003 to match the originals:

    $ perl -Mstrict -Mwarnings -E ' my @originals = (5125670001, 5125760002, 5125760003); my @inputs = qw{ 512567000151257600025125760003 512576000251256700015125760003 512576000351256700015125760002 }; my $canon_original = join q{} => sort @originals; for (@inputs) { my $canon_input = join q{} => sort /(\d{10})/g; say "$_ is ", ($canon_input eq $canon_original) ? "same" : "no +t same"; } ' 512567000151257600025125760003 is same 512576000251256700015125760003 is same 512576000351256700015125760002 is same

    -- Ken

Re: How to test for sameness on a string of numbers
by rjt (Deacon) on Mar 21, 2013 at 07:37 UTC

    Unless I'm completely missing your point, it looks like your sample strings do not contain the original phone numbers. The phone numbers are 10 digits, while the strings are 27. I'm going to assume that's a typo, and that the actual strings you're dealing with are concatenations of the three 10 digit numbers you listed, i.e.:

    512567000151256700025125670003
    512567000251256700015125670003
    512567000351256700015125670002

    If I'm misunderstanding you in some weird way, please let me know.

    By "sameness check", I'm guessing you want a hashing function that will hash the above 3 30-character strings identically. That is, if the 10-digit numbers are $a, $b, and $c, the following 30-character strings should hash equivalently:

    abc, acb, bac, bca, cab, cba

    Finally, no other 30-character strings should hash to the same value.

    If my interpretation of your requirements is correct, there's certainly more than one way to do it:

    #!/usr/bin/env perl use 5.014; use warnings; use Time::HiRes qw/time/; use Benchmark qw/cmpthese timethese/; use Inline 'C'; sub hash_pack($) { join '', sort unpack '(A10)*', shift } sub hash_re($) { join '', sort $_[0] =~ /(\d{10})/g } sub hash_substr($) { my @nums; my $s = shift; while ($s) { push @nums, substr($s,0,10); $s = substr($s,10); } join '',sort @nums; } # Only considers first 3 numbers sub hash_substr2($) { join '', sort substr($_[0],0,10),substr($_[0],10,10),substr($_[0], +20,10); } my @funcs = map { "hash_$_" } qw/pack re substr substr2 c/; my @strings = qw/512567000151256700025125670003 512567000251256700015125670003 512567000351256700015125670002/; for my $s (@strings) { printf "%12s(%s) => %s\n", $_, $s, eval "$_(\$s)" for @funcs; } my $s = $strings[0]; cmpthese timethese(-5, { map { $_ => "$_('$s')" } @funcs }); __END__ __C__ /* Try our own splitter sort. This swaps the numbers in-place * as necessary to obtain a sorted order. */ #include <string.h> #define SIZE 10 #define strswap(s1,s2,size) { \ int i; \ for (i = 0; i < size; i++) { \ s1[i] = s1[i] ^ s2[i]; \ s2[i] = s1[i] ^ s2[i]; \ s1[i] = s1[i] ^ s2[i]; \ } \ } char * hash_c(char *str) { char *n0 = str; char *n1 = str + SIZE; char *n2 = str + SIZE + SIZE; if (strncmp(n0, n1, SIZE) > 0) strswap(n0, n1, SIZE); if (strncmp(n1, n2, SIZE) > 0) strswap(n1, n2, SIZE); if (strncmp(n0, n1, SIZE) > 0) strswap(n0, n1, SIZE); return str; }

    Output

    hash_pack(512567000151256700025125670003) => 5125670001512567000251 +25670003 hash_re(512567000151256700025125670003) => 5125670001512567000251 +25670003 hash_substr(512567000151256700025125670003) => 5125670001512567000251 +25670003 hash_substr2(512567000151256700025125670003) => 5125670001512567000251 +25670003 hash_c(512567000151256700025125670003) => 5125670001512567000251 +25670003 hash_pack(512567000251256700015125670003) => 5125670001512567000251 +25670003 hash_re(512567000251256700015125670003) => 5125670001512567000251 +25670003 hash_substr(512567000251256700015125670003) => 5125670001512567000251 +25670003 hash_substr2(512567000251256700015125670003) => 5125670001512567000251 +25670003 hash_c(512567000151256700025125670003) => 5125670001512567000251 +25670003 hash_pack(512567000351256700015125670002) => 5125670001512567000251 +25670003 hash_re(512567000351256700015125670002) => 5125670001512567000251 +25670003 hash_substr(512567000351256700015125670002) => 5125670001512567000251 +25670003 hash_substr2(512567000351256700015125670002) => 5125670001512567000251 +25670003 hash_c(512567000151256700025125670003) => 5125670001512567000251 +25670003 Benchmark: running hash_c, hash_pack, hash_re, hash_substr, hash_subst +r2 for at least 5 CPU seconds... hash_c: 6 wallclock secs ( 5.71 usr + 0.00 sys = 5.71 CPU) @ 46 +06276.36/s (n=26301838) hash_pack: 6 wallclock secs ( 5.07 usr + 0.00 sys = 5.07 CPU) @ 64 +6938.07/s (n=3279976) hash_re: 6 wallclock secs ( 5.03 usr + 0.00 sys = 5.03 CPU) @ 42 +2000.20/s (n=2122661) hash_substr: 5 wallclock secs ( 5.04 usr + 0.00 sys = 5.04 CPU) @ 3 +28204.96/s (n=1654153) hash_substr2: 4 wallclock secs ( 5.14 usr + 0.00 sys = 5.14 CPU) @ +965458.95/s (n=4962459) Rate hash_substr hash_re hash_pack hash_substr2 + hash_c hash_substr 328205/s -- -22% -49% -66% + -93% hash_re 422000/s 29% -- -35% -56% + -91% hash_pack 646938/s 97% 53% -- -33% + -86% hash_substr2 965459/s 194% 129% 49% -- + -79% hash_c 4606276/s 1303% 992% 612% 377% + --

    You'll need to decide for yourself which is more appealing, and how much performance you'll need to squeeze out of this function. The C solution might be overkill, or the 3.77x speed gain compared to a pure Perl solution might be just what you need.

    Input validation is left as an exercise to the reader.

      Hi,

      This is exactly what I was looking for. Thank you for the quick response.

      Apologies for the typos - it was a late night and I'd been staring at the screen too long.

      I'll probably stick to a pure perl approach, although the C solution is definitely something to think about.

      Thank you again for the help.
      -Will

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1024651]
Approved by davido
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others musing on the Monastery: (10)
As of 2014-09-22 07:45 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (182 votes), past polls