Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?

Re: Groups of Objects with Common Attributes

by alexander_lunev (Pilgrim)
on May 14, 2018 at 20:10 UTC ( #1214496=note: print w/replies, xml ) Need Help??

in reply to Groups of Objects with Common Attributes

#!/usr/local/bin/perl -w use strict; my %things = ( "apple" => ["red", "round", "plant", "fruit"], "orange" => ["orange", "round", "plant", "fruit"], "pumpkin" => ["orange", "round", "plant", "vegetable"], "ball" => ["red", "round", "toy"], ); my %kinds; # # here we populate %kinds hash with kinds from %things, but of course # in real life we should do it while populating %things hash # foreach my $t (keys %things) { foreach my $k (@{$things{$t}}) { $kinds{$k}{$t} = 1; } } # search for "red round plant" my @search = split (' ',"red round plant"); my %kinds_slice; # # we can use slice here only because kinds are strings # if your kinds are not strings - well, it's another story # @kinds_slice{@search} = @kinds{@search}; my %things_sort; # # this is boring "count it" loop, but hey, # we're counting only %kinds_slice, not the whole/all %kinds! # foreach my $kind (keys %kinds_slice) { foreach my $thing (keys %{ $kinds_slice{$kind} }) { $things_sort{$thing}++; } } # # all things that match search, sorted # print "Matches:\n",join ("\n",map {"$_: $things_sort{$_}"} sort { $thi +ngs_sort{$b} <=> $things_sort{$a} } keys %things_sort),"\n\n"; # # or just get the first value to get best match # my @matches = sort { $things_sort{$b} <=> $things_sort{$a} } keys %thi +ngs_sort ; print "Best match: ".$matches[0]."\n Score: ".$things_sort{$matches[0] +}."\n";

Replies are listed 'Best First'.
Re^2: Groups of Objects with Common Attributes
by martink (Initiate) on May 15, 2018 at 22:45 UTC

    You can "deal" with this using hierarchical clustering, in the sense that it might narrow a large problem into a set of many smaller problems that you can brute force.

    First, for each text attribute you assign a dimension, so that your objects are n-vectors that look like, for example, [1,0,0,1,0], if you had 5 text attributes and this object that the first and fourth attribute.

    Then you plug these entries into a hierarchical clustering algorithm. You can cut the tree at any level and look at how objects have been grouped. This will let you identify objects that are close together in this space (i.e. share many attributes).

    What I've done below is an experiment in which I first calculate the sets of all items for which pairwise distances are the same and then hierarchically cluster each set.

    What you can then do is look into each set and find all combinations and find their shared attributes. For example, the largest intersection is apple,orange,pumpkin which is plant,round. But within this set, orange,pumpkin share orange,plant,round - this doesn't get picked up by the clustering ;/

    Here is some code that produces the following output, which I've trimmed for display

    ./pairs | grep cut | cut -d " " -f 4- | sort -u | sort -nr -k3 cluster,items,attr 0 3 2 apple,orange,pumpkin plant,round cluster,items,attr 0 3 1 ball,orange,pumpkin round cluster,items,attr 0 3 1 apple,ball,pumpkin round cluster,items,attr 1 2 2 apple,pumpkin plant,round cluster,items,attr 1 2 1 ball,pumpkin round cluster,items,attr 2 1 4 pumpkin orange,plant,round,vegetable cluster,items,attr 1 1 4 apple fruit,plant,red,round cluster,items,attr 1 1 3 ball red,round,toy cluster,items,attr 0 1 4 orange fruit,orange,plant,round cluster,items,attr 0 1 4 apple fruit,plant,red,round

    A bit messy:

    use List::Util (sum); use List::MoreUtils qw(uniq); use Algorithm::Cluster; my $data = { apple => [qw( red round plant fruit)], orange => [qw(orange round plant fruit)], pumpkin => [qw(orange round plant vegetable)], ball => [qw( red round toy)], }; # list of all attributes my @items = sort keys %$data; my @attr = sort(uniq( map { @{$data->{$_}} } keys %$data)); # data set recast as vectors my $datav; for my $item (@items) { my $vec; for my $attr (@attr) { push @$vec, scalar grep($_ eq $attr, @{$data->{$item}}); } push @$datav, $vec; } # keep track of all item sets for which items had the same distance my $scores; for my $i (0..@items-1) { for my $j ($i+1..@items-1) { my $score = pair_score($datav->[$i],$datav->[$j]); push @{$scores->{$score}}, ($i,$j); } } # now go through the sets of items with # same scores and hierarchically cluster them # based on a distance matrix generated by the score function for my $score (sort {$b <=> $a} keys %$scores) { my @item_idx = uniq(@{$scores->{$score}}); printf("score %d items %s\n",$score,join(",",@items[@item_idx])); cluster_and_report(@item_idx); } # distances matrix sub distance_matrix { my @item_idx = @_; my $distances; for my $i (0..@item_idx-1) { push @{$distances},[]; for my $j (0..$i-1) { push @{$distances->[-1]}, pair_score( $datav->[$item_idx[$i]],$d +atav->[$item_idx[$j]] ); } } return $distances; } # decide how to score a pair of items sub pair_score { my ($x,$y) = @_; my $score = 0; for my $i (0..@$x-1) { if($x->[$i] == $y->[$i]) { $score += $x->[$i]; # +1 score for a shared attribute } else { #$score--; # potential penalty for unshared attributes } } return $score; } sub cluster_and_report { my @item_idx = @_; my $zerov = [ map { 0 } @attr ]; my %param = ( data => distance_matrix(@item_idx), mask => [ map { $zerov } @item_idx ], weight => [ map { 1 } @attr ], transpose => 0, dist => "e", method => "s", ); my $tree = Algorithm::Cluster::treecluster(%param); for my $cut_level ( { my ($clusters) = $tree->cut($cut_level); #printdumper($clusters); my @cluster_ids = uniq(@$clusters); for my $cluster_id (@cluster_ids) { my @cluster_item_idx = map { $item_idx[$_] } grep($clusters->[$_ +] == $cluster_id, (0..@item_idx-1)); my @shared_vector = shared_vector( map { $datav->[$_] } @clus +ter_item_idx); my @shared_attr = map { $attr[$_] } grep($shared_vector[$_] +, (0..@shared_vector-1)); printinfo(sprintf("cut level %d cluster,items,attr %d %d %d %s % +s", $cut_level, $cluster_id, int(@cluster_item_idx), int(@shared_attr), join(",",@items[@cluster_item_idx]), join(",",@shared_attr))); } } } # use the shared attributes as a string to find sets of # items that share same attribute sub shared_vector { my @datav = @_; my $shared; for my $i (0..@{$datav[0]}-1) { if(grep($_->[$i] == 1, @datav) == @datav) { push @$shared, 1; } else { push @$shared, 0; } } return @$shared; }
Re^2: Groups of Objects with Common Attributes
by Dev Null (Acolyte) on May 14, 2018 at 21:54 UTC

    Thank you for your reply Alexander!

    Unfortunately, I don't think I was very clear in defining the problem, for which I apologize.

    You have counted up the number of occurrences of all the attributes, and returned the attribute with the highest number of hits.

    What I'm trying to do is feed it the entire list of objects (~500) and a size for a group (say, 4) and have it return for me the group or groups of 4 distinct objects that have the most attributes in common. So for my sample data from earlier, and a group size of 3, we'd look at all the permutations of 3 objects:

    (apple, orange, pumpkin): orange: 2 round: 3 plant: 3 fruit: 2 (apple, orange, ball): red: 2 round: 3 plant: 2 fruit: 2 (apple, pumpkin, ball): red: 2 round: 3 plant:2 (orange, pumpkin, ball): orange: 2 round: 3 plant: 2
    (Ignoring sets of size 1 as uninteresting...) And then, depending on how we choose to score them, say that (apple, orange, pumpkin) was the best set, and (apple, orange, ball) next best.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1214496]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (3)
As of 2021-03-07 09:58 GMT
Find Nodes?
    Voting Booth?
    My favorite kind of desktop background is:

    Results (120 votes). Check out past polls.