http://www.perlmonks.org?node_id=906081


in reply to Using hashes for set operations...

I like the idea of starting with a master hash that gives the stringification of each complex data type. Making that the basic unit for manipulation, rather than a separate internment step, is interesting.

But your simple elegant code is relying on implicit stringification, and repeated stringification of the same values. This is inefficient if it's a complex data structure, and needs an explicit call in the general code to handle ad-hoc data structures that are not blessed into a class with a stringify operator that's suitable for this purpose.

As for getting a fake "" key you need to toss out, I don't think that's a problem if you make sure that the stringification never returns "" for a valid item in the set.

Ultimately, I think the answer to "how do I find an intersection" isn't to repeat some incantation invented by someone else, but to simply call intersection. The algorithm should be canned in a CPAN module.

Replies are listed 'Best First'.
Re^2: Using hashes for set operations...
by LanX (Saint) on May 23, 2011 at 09:11 UTC
    > But your simple elegant code is relying on implicit stringification, and repeated stringification of the same values. This is inefficient if it's a complex data structure, and needs an explicit call in the general code to handle ad-hoc data structures that are not blessed into a class with a stringify operator that's suitable for this purpose.

    In Perl stringification of references has no performance penalty. It's just a string with the reference addressą and the type (this includes package name if blessed).

    perl -e '$p=[];print $p' ARRAY(0x928c880)

    (I think you are confusing with other language like JS˛, where the whole data is dumped)

    OTOH, there is a doubled memory consumption when using primitive scalars like long strings.

    >As for getting a fake "" key you need to toss out, I don't think that's a problem if you make sure that the stringification never returns "" for a valid item in the set.

    well I think a special case handling "" wouldn't cost too much performance...it's just not elegant anymore.

    > Ultimately, I think the answer to "how do I find an intersection" isn't to repeat some incantation invented by someone else, but to simply call intersection. The algorithm should be canned in a CPAN module.

    There is already Set::Object which seems to follow the same ideas, but it looks heavy and uses inline C...

    Anyway my point was first to improve the FAQ and then thinking about a module.

    Cheers Rolf

    1) Of course one should be careful when mixing strings and references in one set... :)

    2) e.g. javascript:a=[1,[2,3]];alert(a) shows 1,2,3

      In Perl stringification of references has no performance penalty. It's just a string with the reference addressą and the type (this includes package name if blessed).
      perl -e '$p=[];print $p'
      ARRAY(0x928c880)
      (I think you are confusing with other language like JS˛, where the whole data is dumped)
      No, I'm thinking that it is pointless to compare references since any two copies will test as unequal. Instead, you must manually write something that stringifies (or hashes, in the other sense of the word) to a canonical form in order to then test for equivalence.

      I guess that depends on what the user intends, so the FAQ should point out that using a reference (or object) as a hash key will stringify as you show, so do the same test as == against the reference itself (the address).

      I think intersection and friends should be like sort, in that they can take a piece of code that is used to determine what is meant by equivalence in this particular case. That's easy to call but can be inefficient; and just like you use the whatever maneuver with sort to cache the keys, you could do the same with intersection. But the eventual module can have that built-in, as your ideas directly incorporate that kind of keying. Then the user needs to provide code to produce a canonical key of one item, as opposed to comparing the equivalence of two parameters.

      But back to the underlying code: If I want two ad-hoc uses of [qw/1 2 3/] to be considered the same, stringifying the reference won't do it. It needs to call a function to generate the string key from the contents. And we suppose that this is expensive, so only call it once per value in each input list.

      The user wants to find the intersection of two lists, so he would be told to pass @set1 and @set2, and optionally a &func, which defaults to built-in stringification. Prepare your internal %set1 from @set1 and func(each element), and arrange the code (at least in the case where a func is passed -- it could have different implementations) to not need to call func again on some value but to always keep it with the key.

        > No, I'm thinking that it is pointless to compare references since any two copies will test as unequal.

        your thinking of nested structures I'm thinking of objects. If you have instances representing something like "Employees" you don't wanna identify twins.

        > so do the same test as == against the reference itself (the address).

        Actually it's eq, think about the way scalars are compared.

        > The user wants to find the intersection of two lists, so he would be told to pass @set1 and @set2, and optionally a &func, which defaults to built-in stringification.

        I was already meditating about this, I also like the Python approach (where sets are a built-in datatype) to make the hash function operate on the basis of an "equality" method of "hashable" objects. (but I don't know how this is efficiently implemented) IIRC it's possible in Perl to overload the way objects are stringified.

        >If I want two ad-hoc uses of [qw/1 2 3/] to be considered the same, stringifying the reference won't do it.

        IMHO sets of "deeply compared" nested structure are better done with nested hashes. (kind of a tree search for each level of nesting)

        > And we suppose that this (the key function) is expensive, so only call it once per value in each input list.

        agreed.

        BTW: interesting read

        Cheers Rolf

        update: fixed unescaped brackets