Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options

Groups of Objects with Common Attributes

by Dev Null (Acolyte)
on May 14, 2018 at 18:50 UTC ( #1214490=perlquestion: print w/replies, xml ) Need Help??

Dev Null has asked for the wisdom of the Perl Monks concerning the following question:

Howdy Monks,

I have a set of objects, each of which has a collection of text attributes. I am trying to find groups of those objects with the most, and largest, sets of attributes in common. If I require all objects in a grouping to have all the attributes for it to count as a set, then clearly I'm just taking the intersection of the attribute sets, which is fairly simple. But I want to know about "partial sets" as well - I think an example will serve me best:

"apple" => ("red", "round", "plant", "fruit") "orange" => ("orange", "round", "plant", "fruit") "pumpkin" => ("orange", "round", "plant", "vegetable") "ball" => ("red", "round", "toy")

EDIT: Updated this, as my original example was unclear.

For a group size of 3, I check the sets of each permutation of 3 objects:

(apple, orange, pumpkin): orange: 2 round: 3 plant: 3 fruit: 2 (apple, orange, ball): red: 2 round: 3 plant: 2 fruit: 2 (apple, pumpkin, ball): red: 2 round: 3 plant:2 (orange, pumpkin, ball): orange: 2 round: 3 plant: 2

(Ignoring sets of size 1 as uninteresting...) Again, this is not that hard to calculate for a given group. But what I'm trying to do is _find_ the groups with the most/largest sets (I'm provisionally using a scoring system which I _think_ is mostly irrelevant to my problem...) Brute force checking the score of all the combinations of 4 from my collection of 500-odd objects leaves me checking over fifty billion combinations; in addition to offending my sensibilities, this is kinda slow. So I'm hoping your Monkishnesses can help come up with an algorithm / search strategy for constructing sets to score that are more likely to be "interesting". Maybe something like:

For each pairing of two objects, find the intersection of the two attribute sets.

_For each pair intersection, ordered by the size of the intersect descending and above some minimum size, calculate the intersect with each other object

__For each triplet intersection, ordered by the size of the intersect descending and above some minimum size, calculate the score for the set of objects

(Just thought of that then. I'll have to go try it, though it is pretty fuzzy still.) The idea being that each time you cut off a block of intersections for being below a minimum size, you chop out a huge chunk of the search space. But since I now am not doing a complete search of the combination space, don't I run the risk of coming at the same combination from different directions? - like measuring both of "apple", "pumpkin", "ball" and "ball", "apple", "pumpkin - and doing the work twice.

In any case, I appreciate any ideas, and apologize in advance if this is too much of an algorithm question that isn't related enough to perl specifically.

Replies are listed 'Best First'.
Re: Groups of Objects with Common Attributes
by alexander_lunev (Pilgrim) on May 14, 2018 at 20:10 UTC
    #!/usr/local/bin/perl -w use strict; my %things = ( "apple" => ["red", "round", "plant", "fruit"], "orange" => ["orange", "round", "plant", "fruit"], "pumpkin" => ["orange", "round", "plant", "vegetable"], "ball" => ["red", "round", "toy"], ); my %kinds; # # here we populate %kinds hash with kinds from %things, but of course # in real life we should do it while populating %things hash # foreach my $t (keys %things) { foreach my $k (@{$things{$t}}) { $kinds{$k}{$t} = 1; } } # search for "red round plant" my @search = split (' ',"red round plant"); my %kinds_slice; # # we can use slice here only because kinds are strings # if your kinds are not strings - well, it's another story # @kinds_slice{@search} = @kinds{@search}; my %things_sort; # # this is boring "count it" loop, but hey, # we're counting only %kinds_slice, not the whole/all %kinds! # foreach my $kind (keys %kinds_slice) { foreach my $thing (keys %{ $kinds_slice{$kind} }) { $things_sort{$thing}++; } } # # all things that match search, sorted # print "Matches:\n",join ("\n",map {"$_: $things_sort{$_}"} sort { $thi +ngs_sort{$b} <=> $things_sort{$a} } keys %things_sort),"\n\n"; # # or just get the first value to get best match # my @matches = sort { $things_sort{$b} <=> $things_sort{$a} } keys %thi +ngs_sort ; print "Best match: ".$matches[0]."\n Score: ".$things_sort{$matches[0] +}."\n";

      You can "deal" with this using hierarchical clustering, in the sense that it might narrow a large problem into a set of many smaller problems that you can brute force.

      First, for each text attribute you assign a dimension, so that your objects are n-vectors that look like, for example, [1,0,0,1,0], if you had 5 text attributes and this object that the first and fourth attribute.

      Then you plug these entries into a hierarchical clustering algorithm. You can cut the tree at any level and look at how objects have been grouped. This will let you identify objects that are close together in this space (i.e. share many attributes).

      What I've done below is an experiment in which I first calculate the sets of all items for which pairwise distances are the same and then hierarchically cluster each set.

      What you can then do is look into each set and find all combinations and find their shared attributes. For example, the largest intersection is apple,orange,pumpkin which is plant,round. But within this set, orange,pumpkin share orange,plant,round - this doesn't get picked up by the clustering ;/

      Here is some code that produces the following output, which I've trimmed for display

      ./pairs | grep cut | cut -d " " -f 4- | sort -u | sort -nr -k3 cluster,items,attr 0 3 2 apple,orange,pumpkin plant,round cluster,items,attr 0 3 1 ball,orange,pumpkin round cluster,items,attr 0 3 1 apple,ball,pumpkin round cluster,items,attr 1 2 2 apple,pumpkin plant,round cluster,items,attr 1 2 1 ball,pumpkin round cluster,items,attr 2 1 4 pumpkin orange,plant,round,vegetable cluster,items,attr 1 1 4 apple fruit,plant,red,round cluster,items,attr 1 1 3 ball red,round,toy cluster,items,attr 0 1 4 orange fruit,orange,plant,round cluster,items,attr 0 1 4 apple fruit,plant,red,round

      A bit messy:

      use List::Util (sum); use List::MoreUtils qw(uniq); use Algorithm::Cluster; my $data = { apple => [qw( red round plant fruit)], orange => [qw(orange round plant fruit)], pumpkin => [qw(orange round plant vegetable)], ball => [qw( red round toy)], }; # list of all attributes my @items = sort keys %$data; my @attr = sort(uniq( map { @{$data->{$_}} } keys %$data)); # data set recast as vectors my $datav; for my $item (@items) { my $vec; for my $attr (@attr) { push @$vec, scalar grep($_ eq $attr, @{$data->{$item}}); } push @$datav, $vec; } # keep track of all item sets for which items had the same distance my $scores; for my $i (0..@items-1) { for my $j ($i+1..@items-1) { my $score = pair_score($datav->[$i],$datav->[$j]); push @{$scores->{$score}}, ($i,$j); } } # now go through the sets of items with # same scores and hierarchically cluster them # based on a distance matrix generated by the score function for my $score (sort {$b <=> $a} keys %$scores) { my @item_idx = uniq(@{$scores->{$score}}); printf("score %d items %s\n",$score,join(",",@items[@item_idx])); cluster_and_report(@item_idx); } # distances matrix sub distance_matrix { my @item_idx = @_; my $distances; for my $i (0..@item_idx-1) { push @{$distances},[]; for my $j (0..$i-1) { push @{$distances->[-1]}, pair_score( $datav->[$item_idx[$i]],$d +atav->[$item_idx[$j]] ); } } return $distances; } # decide how to score a pair of items sub pair_score { my ($x,$y) = @_; my $score = 0; for my $i (0..@$x-1) { if($x->[$i] == $y->[$i]) { $score += $x->[$i]; # +1 score for a shared attribute } else { #$score--; # potential penalty for unshared attributes } } return $score; } sub cluster_and_report { my @item_idx = @_; my $zerov = [ map { 0 } @attr ]; my %param = ( data => distance_matrix(@item_idx), mask => [ map { $zerov } @item_idx ], weight => [ map { 1 } @attr ], transpose => 0, dist => "e", method => "s", ); my $tree = Algorithm::Cluster::treecluster(%param); for my $cut_level ( { my ($clusters) = $tree->cut($cut_level); #printdumper($clusters); my @cluster_ids = uniq(@$clusters); for my $cluster_id (@cluster_ids) { my @cluster_item_idx = map { $item_idx[$_] } grep($clusters->[$_ +] == $cluster_id, (0..@item_idx-1)); my @shared_vector = shared_vector( map { $datav->[$_] } @clus +ter_item_idx); my @shared_attr = map { $attr[$_] } grep($shared_vector[$_] +, (0..@shared_vector-1)); printinfo(sprintf("cut level %d cluster,items,attr %d %d %d %s % +s", $cut_level, $cluster_id, int(@cluster_item_idx), int(@shared_attr), join(",",@items[@cluster_item_idx]), join(",",@shared_attr))); } } } # use the shared attributes as a string to find sets of # items that share same attribute sub shared_vector { my @datav = @_; my $shared; for my $i (0..@{$datav[0]}-1) { if(grep($_->[$i] == 1, @datav) == @datav) { push @$shared, 1; } else { push @$shared, 0; } } return @$shared; }

      Thank you for your reply Alexander!

      Unfortunately, I don't think I was very clear in defining the problem, for which I apologize.

      You have counted up the number of occurrences of all the attributes, and returned the attribute with the highest number of hits.

      What I'm trying to do is feed it the entire list of objects (~500) and a size for a group (say, 4) and have it return for me the group or groups of 4 distinct objects that have the most attributes in common. So for my sample data from earlier, and a group size of 3, we'd look at all the permutations of 3 objects:

      (apple, orange, pumpkin): orange: 2 round: 3 plant: 3 fruit: 2 (apple, orange, ball): red: 2 round: 3 plant: 2 fruit: 2 (apple, pumpkin, ball): red: 2 round: 3 plant:2 (orange, pumpkin, ball): orange: 2 round: 3 plant: 2
      (Ignoring sets of size 1 as uninteresting...) And then, depending on how we choose to score them, say that (apple, orange, pumpkin) was the best set, and (apple, orange, ball) next best.
Re: Groups of Objects with Common Attributes
by hdb (Monsignor) on May 15, 2018 at 07:23 UTC

    Just some odd thoughts rather than a clear strategy. Say you have 75 attributes and 500 objects.

    1. Define a measure of distance between any pair of objects, e.g. 75 minus the number of common objects or 1 over the number of common objects.
    2. Calculate all distances between pairs. This gives you 250*499 distances.
    3. Apply some kind of clustering algorithm, e.g. nearest neighbors or something. These should be available on CPAN.

    Playing with distance measures and clustering algorithms (and their parameters) will lead you to a number of different solutions.

      Continuing on what hdb suggested,

      Each object lives in a space where the said Attributes are the coordinates.

      Coordinates can be discreet or binary or continuous, e.g. binary: Apple -> (orange=0, red=1, plant=1, fruit=1,toy=0). Continuous means that there is so much probability that apple is orange (0.5) or apple is a toy (0.01). etc. Notice that each object is characterised by a set of coordinates which includes ALL attributes. If attribute does not relate to object then it is set to zero.

      In this space, there is a distance metric, e.g. euclidean (others exist) to tell you how far apart are an Apple and a pumpkin.

      Clustering is a process which groups nearby objects (in this space) together based on the distance metric chosen.

      I see a problem with the above approach: most objects will have most of their coordinates set to zero, e.g. an Apple has only 3 attributes turned on and I guess the rest 72 will be off/zero. The problem is that clustering may group together objects because they have in common the absence of a lot of attributes, and you probably want to group objects together because they have in common the presence of an attribute. An obstacle but not a tough one.

Re: Groups of Objects with Common Attributes
by LanX (Sage) on May 15, 2018 at 11:30 UTC
    k-means clustering ? --> Algorithm::KMeans


    With aforementioned module you can even choose the range of permitted clusters

    # For very large data files, setting K to 0 will result in searching + through too # many values for K. For such cases, you can range limit the values + of K to search # through by my $clusterer = Algorithm::KMeans->new( datafile => $datafile, mask => "N111", Kmin => 3, Kmax => 10, cluster_seeding => 'random', + # or 'smart' terminal_output => 1, write_clusters_to_files => 1 +, );

    Cheers Rolf
    (addicted to the Perl Programming Language and ☆☆☆☆ :)
    Wikisyntax for the Monastery

Re: Groups of Objects with Common Attributes
by martink (Initiate) on May 16, 2018 at 05:15 UTC

    I found a better way than the hierarchical clustering.

    The code below generates a line for each combination: (a) count of attributes, (b) count of items, (c) attribute list (d) item list.

    I pass it through sort -u because there may be duplication.

    1 4 round apple,ball,orange,pumpkin 2 2 red,round apple,ball 2 3 plant,round apple,orange,pumpkin 3 1 red,round,toy ball 3 2 fruit,plant,round apple,orange 3 2 orange,plant,round orange,pumpkin 4 1 fruit,orange,plant,round orange 4 1 fruit,plant,red,round apple 4 1 orange,plant,round,vetetable pumpkin

    The lines that trivially report a single item and all its attributes would come into play if you had multiple items with the same attribute set.

    The code is below. Basically what happens is that for the item2attr hash, I look at all unique attribute lists across items and then report the items that have this set in common. This comes into its own when you flip the hash and report the same thing, but flip the role of item and attribute. In the end you get a full list.

    I hope that you find this useful and that it does what I think it does :)

    my $item2attr = { apple => {red=>1, round=>1,plant=>1,fruit=>1}, orange => {orange=>1,round=>1,plant=>1,fruit=>1}, pumpkin => {orange=>1,round=>1,plant=>1, vetetable= +>1}, ball => {red=>1, round=>1, + toy=>1}, }; ## alternatively in the block below, generate a random data set with ## 500 items and 75 attributes with randomly 2-10 attributes per item =pod my $n_items = 500; my $n_attributes = 75; my $min_attr_in_item = 2; my $max_attr_in_item = 10; $item2attr = {}; for my $i (1..$n_items) { my $item = sprintf("it%03d",$i); my $n_attr = $min_attr_in_item + rand(1+$max_attr_in_item-$min_attr_ +in_item) ; my @attrs = sort ((map { sprintf("at%03d",$_) } (sort {rand() <=> r +and() } (1..75)))[0..$n_attr-1]); #printinfo($item,int(@attrs),@attrs); map {$item2attr->{$item}{$_} = 1} @attrs; } =cut # list of all items and attributes my @items = sort keys %$item2attr; my @attr = sort(uniq( map { keys %$_ } values %$item2attr)); # flip the hash my $attr2item; for my $attr (@attr) { map { $attr2item->{$attr}{$_} = $item2attr->{$_}{$attr} if $item2at +tr->{$_}{$attr} } @items; } report_sets($item2attr); report_sets($attr2item,-swap=>1); sub report_sets { my ($hash,%args) = @_; my $sets; for my $key (keys %$hash) { my $set_hash_str = join(",", sort keys %{$hash->{$key}}); $sets->{$set_hash_str}{$key}++; } for my $set_hash_str (keys %$sets) { my @attr = split(",",$set_hash_str); my @shared_attr = shared_items($hash,@attr); if($args{-swap}) { printinfo(int(@shared_attr),int(@attr),join(",",@shared_attr),jo +in(",",@attr)); } else { printinfo(int(@attr),int(@shared_attr),join(",",@attr),join(",", +@shared_attr)); } } } sub shared_items { my ($hash,@attr) = @_; my @shared_items; my @items = keys %$hash; for my $item (@items) { my $n = grep($hash->{$item}{$_}, @attr); push @shared_items, $item if $n == @attr; } return sort @shared_items; }

      After couple days of looking up what on earth is clustering, k-means, etc. -- hopeless -- I took closer look at your code. Wow, so simple. Thanks, martink! Here is re-factored version, if I may, discarding all that was perceived superfluous and simplifying (to fit my brain). So it's, effectively, just 2 plain loops: over all attributes and all items. Loop over items doesn't add to example with pumpkins, but is required for other test cases.

      I wonder, is it a mathematical fact, that even for 500 items and 75 attributes, there can be no more than 575 sets of common attributes? It somewhat contradicts to what I remember from combinatorics.

      use strict; use warnings; use feature 'say'; use List::Util qw/ uniq all /; use Data::Dump 'dd'; my $item2attr = { apple => { red => 1, round => 1, plant => 1, fruit => 1 } +, orange => { orange => 1, round => 1, plant => 1, fruit => 1 } +, pumpkin => { orange => 1, round => 1, plant => 1, vegetable => 1 } +, ball => { red => 1, round => 1, toy => 1 }, }; # list of all items and attributes my @items = sort keys %$item2attr; my @attr = sort( uniq( map { keys %$_ } values %$item2attr )); # flip the hash my $attr2item; for my $attr ( @attr ) { for ( @items ) { $attr2item-> { $attr }{ $_ } = 1 if $item2attr-> { $_ }{ $attr } } } #dd $item2attr; #say '-----------------------------------'; #dd $attr2item; #say '-----------------------------------'; my %solutions; # hash, to prevent duplicates for ( @attr ) { my @items_ = keys %{ $attr2item-> { $_ }}; my @attr_ = grep { my $attr = $_; all { $item2attr-> { $_ }{ $attr }} @items_ } @attr; _add_solution( \@attr_, \@items_ ) } for ( @items ) { my @attr_ = keys %{ $item2attr-> { $_ }}; my @items_ = grep { my $item = $_; all { $attr2item-> { $_ }{ $item }} @attr_ } @items; _add_solution( \@attr_, \@items_ ) } dd values %solutions; # then filter solutions for required number of common # attributes, or find max set of common attributes, # or find max set of items with any common attributes, etc. sub _add_solution { # writes to %solutions my ( $attr, $items ) = @_; return unless $#$items; # skip uninteresting @$_ = sort @$_ for @_; $solutions{ join ',', @$attr } = [ scalar @$attr, # count of attributes scalar @$items, # count of items $attr, # attribute list $items # item list ] } __END__ ( [2, 2, ["red", "round"], ["apple", "ball"]], [2, 3, ["plant", "round"], ["apple", "orange", "pumpkin"]], [1, 4, ["round"], ["apple", "ball", "orange", "pumpkin"]], [3, 2, ["fruit", "plant", "round"], ["apple", "orange"]], [3, 2, ["orange", "plant", "round"], ["orange", "pumpkin"]], )

      Edit: fixed issue with sorting.

Re: Groups of Objects with Common Attributes
by Anonymous Monk on May 14, 2018 at 19:51 UTC
    ok, 500 objects, but how many unique attributes?
      About 75 unique attributes. Most objects have between 2 and 7 attributes. So the "attribute denisty" - if that made-up term conveys what I intend - is fairly low.

        Most objects have between 2 and 7 attributes.
        With 7 attributes, there are but 2**7 == 128 subsets to consider; and no attribute group with more that 7 attributes.

        You could enumerate all subsets of each object. That might total to a few thousand unique attribute groups for the whole problem. In other words, with size and density so low, you can easily afford to construct the powerset of attributes. This way you can transform the 500 by 75 into 500 by few-thousands, search and sort by cardinality or weight, etc.

Re: Groups of Objects with Common Attributes
by Anonymous Monk on May 15, 2018 at 15:39 UTC

    Ah. Another Data mining question. Please note that your fundamental problem, set cover optimization, is known to be NP-hard.

    That said, take a look at Comparing two arrays and other posts by same OP.

    500 by 75 sounds too small to be a real-life problem. But if it is, could you please share with us the focus area of your research?

      Heh. "Research."

      I was playing a tablet game the other day (Gems of War, if anyone really cares), and the question of how to write a program to optimize team synergies intrigued me. It's not even an important part of the game - it was just the bit that interested me. I wrote the quick-and-dirty brute force solution but it's slow and aesthetically unpleasing, so I started poking around at better solutions - it seemed like the sort of thing that would be a known problem, and that you folks would know about. I only only went to the trouble of abstracting my problem description because I didn't want to have to explain the game mechanics, and because I thought it might make it easier to recognize the general class of problem.

      Thanks everyone for the feedback! As a hobby project, it may take me a day or two to have time to work on it again, but I'll definitely pop back in to let you know how it goes.

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1214490]
Approved by marto
Front-paged by haukex
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others about the Monastery: (4)
As of 2022-12-01 03:00 GMT
Find Nodes?
    Voting Booth?