Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

Fastest data structure compare?

by swartz (Beadle)
on May 20, 2009 at 18:54 UTC ( #765304=perlquestion: print w/ replies, xml ) Need Help??
swartz has asked for the wisdom of the Perl Monks concerning the following question:

I need to compare two arbitrary hashes as efficiently as possible. I don't need to know the differences between them, I just need to know if they are the same or not.

This is for the purpose of memoizing a constructor. I want new(%params1) and new(%params2) to return the same object iff %params1 = %params2 deeply. The Memoize module just joins the arguments with a string, it explicitly does not work with reference arguments.

Some options include

  • Modules like Data::Compare and List::Compare, but these seem to be Perl-based
  • Serializing with Storable and comparing the strings, but not sure if this is guaranteed to be canonical
Before trying to benchmark all of these for different parameter sets, I wanted to know if there was a standard (e.g. XS-based) solution.

Thanks! Jon

Comment on Fastest data structure compare?
Re: Fastest data structure compare?
by JavaFan (Canon) on May 20, 2009 at 19:18 UTC
    If it would just be the comparison between two hashes, then any serializing-based solution isn't going to be the fastest, unless almost all your comparisons will be between hashes that are the same. Fast solutions will terminate comparing as soon as a difference has been found.

    But you want more than that. Basically, you want to get a unique, reproducable, representation of your datastructure - after all, after you've found a hash that you've seen before, you want to return the same thing as before.

    I'd go for Data::Dumper or YAML.

    But I am curious. What kind of application are you making where the constructor is such a bottle neck you're are looking for the fastest way of serialization?

      "Fast solutions will terminate comparing as soon as a difference has been found." - good point. Hard to say what the common case will be.

      If you were going to use Data::Dumper or YAML, why wouldn't you use Storable - isn't that generally the fastest serializer?

      This is for CHI (http://cpan.uwinnipeg.ca/dist/CHI). There are instances when you want to grab a cache object for a particular namespace "from the ether", and have it automatically return the same object if it was already requested. Think of DBI->connect_cached.

        "Fast solutions will terminate comparing as soon as a difference has been found." - good point. Hard to say what the common case will be.
        I'd think it's totally irrelevant in your case. Or at least, it would totally irrelevant if I were to make something similar that what you describe you are making. I wouldn't go and compare every outstanding object to see whether it may match - if speed is important, I certainly don't want to do a linear search!
        If you were going to use Data::Dumper or YAML, why wouldn't you use Storable - isn't that generally the fastest serializer?
        Because I know Data::Dumper and YAML, and I don't know Storable very well. And I like YAML. As for what is the fastest, I would only consider that if I was convinced the constructor was that much of a bottleneck that it would matter.
Re: Fastest data structure compare?
by Herkum (Parson) on May 21, 2009 at 00:02 UTC

    You could always write your own serializer for use with Memoize.

    On the other hand, maybe you should consider how you object is being constructed if you have multiple layers of parameters in your object.

    I wrote some caching code and what it checked for was the primary key of the object. If there was no primary key, it was a new object. If there was a primary key, it would check the cache and return it if it was cached, otherwise I fetched it from the database.

    There is no reason for you to verify every single param to determine if you are working with the same object, pick something that should be unique for your object instead identical parameters.

Re: Fastest data structure compare?
by repellent (Priest) on May 21, 2009 at 00:15 UTC
    Have a look at Test::Deep::NoTest. I'd start there.

    JSON::XS is one of the fastest, if not the fastest serializer in Perl. Look for the $json->canonical([$enable]) method in the documentation.

    To really experiment with your data structures, go with Data::Dump::Streamer.
Re: Fastest data structure compare?
by punch_card_don (Curate) on May 21, 2009 at 00:20 UTC
    How about this for single-level hashes:
    my @array_1 = sort keys( %hash_1); my $key_string_1 = join('', @array_1); my @array_2 = sort keys( %hash_2); my $key_string_2 = join('', @array_2); if ($key_string_1 eq $key_string_2) { ...
    and you've got your strings of keys already.
Re: Fastest data structure compare?
by BrowserUk (Pope) on May 21, 2009 at 01:36 UTC

    The good thing about using stringified representations of hashes for caching purposes, is that it makes each lookup O(1) (with a largish constant).

    Whereas, if you used one of the apparently more efficient, short-circuiting concurrent traversal methods, you have to perform that traversal against every other item (hash) in your cache. And that's O(N) (with a smaller constant). The stringified method wins hands down

    One tip if you decide to write your own stringifier, (I'd use Storable::freeze() which is generally much faster than any of the to-text serialisers), prefix your stringified hash with the number of keys it contains. There's no point in comparing the whole stringified hashes if they have different numbers of keys.

    Update: Actually, you might as well throw the length of the stringified hash up front also. Something like:

    use Storable qw[ freeze ]; my %cache; sub memoised { my $href = shift; my $cacheKey = pack 'NN/a*', scalar keys %$href, freeze $href; return $cache{ $cacheKey } if exists $cache{ $cacheKey }; ## calculate value for href return $cache{ $cacheKey } = $value; }

    That should short-circuit any hash collision string compares very quickly in all but the most nearly identical cases. Mind you, you'd be devilishly unlucky to have two nearly identical stringified hashes hash to the same bucket.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Fastest data structure compare?
by eserte (Deacon) on Oct 08, 2009 at 10:58 UTC
    Using Storable is problematic, because it's not guaranteed that semantically equivalent data is serialized in the same way. Here's an example playing with the internal utf-8 flag --- while Test::More's is_deeply and Data::Compare get it right, checking the serialized Storable data fails:
    #!/usr/bin/perl use Data::Compare qw(); use Storable qw(nfreeze); use Test::More qw(no_plan); my $data1 = ["f\xfcbar"]; my $data2 = [substr("f\xfcbar\x{0100}", 0, -1)]; is_deeply($data1, $data2, "is_deeply test"); ok(Data::Compare::Compare($data1, $data2), "Data::Compare"); ok(nfreeze($data1) eq nfreeze($data2), "storable serialized");
    I just benchmarked Test::More::is_deeply vs. Data::Compare and found that the latter is 3x faster for a data set which size is ~6MB as a storable-serialized file. This probably depends on the structure of the data set.
      Storable warns against this
      if you happen to use your numbers as strings between two freezing operations on the same data structures, you will get different results.

      There is no facility either to return all strings as utf8 sequences, or to attempt to convert utf8 data back to 8 bit and croak() if the conversion fails.

      It is easily avoided if you use encoding/Encode
        Encode may help in the utf8 case, but not in the case of different integer representation (as mentioned in the Storable manpage):
        #!/usr/bin/perl use Data::Compare qw(); use Storable qw(nfreeze); use Test::More qw(no_plan); my $data1 = { foobar => 1 }; my $data2 = { foobar => "1" }; $Storable::canonical = 1; is_deeply($data1, $data2, "is_deeply test"); ok(Data::Compare::Compare($data1, $data2), "Data::Compare"); ok(nfreeze($data1) eq nfreeze($data2), "storable serialized");
Re: Fastest data structure compare?
by Your Mother (Canon) on Oct 08, 2009 at 16:31 UTC

    As repellent noted, JSON::XS is *very* fast. For some data sizes (in some benchmarks) it beats Storable. Here's a sample. Note the examples show the main caveat, JSON cares whether things are numbers or strings.

    use JSON::XS; my %bingo = ( one => 1, two => 2, three => 5, ); my %bongo = ( three => 5, two => 2, one => 1, ); my %bango = ( one => "1", two => "2", three => "5", ); print "Is bingo bongo? ", encode_json(\%bingo) eq encode_json(\%bongo) ? "yes!\n" : "noes :( +\n"; print "Is bango bongo? ", encode_json(\%bango) eq encode_json(\%bongo) ? "yes!\n" : "noes :( +\n";
    Is bingo bongo? yes! Is bango bongo? noes :(

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://765304]
Approved by ww
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others cooling their heels in the Monastery: (11)
As of 2014-09-16 21:37 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (50 votes), past polls