more useful options | |
PerlMonks |
RFC: Set::Select: get intersection or union of sets; or more generally, the set of elements that are in one or more input setsby kikuchiyo (Hermit) |
on Dec 12, 2018 at 21:58 UTC ( [id://1227173]=perlmeditation: print w/replies, xml ) | Need Help?? |
If we have two sets, it is considered a solved problem to get their union or intersection or symmetric difference, there is even an entry in perlfaq4 about it. The situation is slightly more complicated if we have more than two input sets, because then it's a valid question to ask to e.g. get the set of elements that are in the first or second set but not in the third and fourth etc. The number of combinations grows rapidly with the number of input sets, and just writing ad hoc solutions to each little problem becomes infeasible. So a more general solution is needed - the hard part is designing the user interface so that it is able to express all the possible combinations of selections in a general and flexible, yet efficient and understandable manner. A cursory search of CPAN brought up several (abandoned?) modules in the Set::* namespace, but none of them was exactly what I needed. I have the outline of an attempted solution. It's an OO module that has a constructor to which the input sets can be fed, and one method called select, which accepts a selector string argument and emits (an arrayref of) the elements that match the selector. If we have 3 input sets, then the '110' selector string selects all elements that are in the first and second sets but not in the third.
A Venn diagram that may or may not make the intent clearer: .---. / 1 \ | | .--+--. .--+--. / | 3 X 5 | \ | | / \ | | | 2 \/ 7 \/ 4 | | |`---'| | | \ 6 / | \ \ / / `------^------' I think these selector strings as the primary (and only) user interface are better than the possible alternatives that come to mind: a verbose, ad hoc query language would have to be explained at length in the documention, tested carefully in the source, and parsed painfully at runtime, while a forest of arbitrarily named methods to select this or that subset would bloat the code needlessly and make it harder to use. Using regular expressions opens the door to abuse, but it also allows convenient and terse selector strings, makes the implementation efficient, and it's something people already know. If the elements are hashrefs (representing a record or object or something), there is a mode to use not the elements themselves but a named key inside them as the basis of selection, as the second example shows. This mode can be considered buggy as it is now, because only one version of a record with the same key is stored (in the example, some values are discarded. I don't have a good solution for this problem yet, partly because it would make the implementation slower and more complicated, partly because I don't know what would be the right thing to do. Questions:
Back to
Meditations
|
|