Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Clustering Numbers with Overlapping Members

by monkfan (Curate)
on Aug 07, 2006 at 11:03 UTC ( [id://565919]=perlquestion: print w/replies, xml ) Need Help??

monkfan has asked for the wisdom of the Perl Monks concerning the following question:

Dear Fellow Brother Monks,

Given a list of ordered (sorted ascending) number : @nlist = (0,1,2,3,4,5,6,8,10);
List of candidate hash key: my @key_list = ('A'..'Z');
And a tolerance value: my $tolerance = 1;


Update: The array @nlist may come with duplicate elements.
For example @nlist = (0,0,1,2,3,3,4,5,6,8,8,10);

I would like to cluster the numbers in @nlist into a hash. Here each hash will contain elements within the tolerance (+/- 1) and allowing recurring (overlapping) elements. This is provided they are within the tolerance value. So finally I would have something like this:
my $VAR1 = { 'A' => [0,1,2], # centroid is 1, with other elem 0 and 2 ( +1 +/- 1) 'B' => [1,2,3], 'C' => [2,3,4], 'D' => [3,4,5], 'E' => [4,5,6], 'F' => [5,6], # centroid is 6 but only with one elem 5 (i +.e. 6-1) 'G' => [8], # next centroid is 8 but no members 'I' => [10], # same for centroid 10 }; # As the @nlist array grow larger the final hash will also grow larg +er.
So in principle every element in the array will be a centroid (except the first one, in this case it is 0).Then it finds its members within the prespecified tolerance value.

Update2:
I should add that when the very first element doesn't have its neighbour then it forms another cluster. In other words we ignore the first element only when it has neighbour within tolerance (see example below).

I am currently stuck with my code below. I really am not sure how to go about it.
use strict; use Data::Dumper; use Carp; # Size of this array could be much greater than this my @nlist = ( 0, 1, 2, 3, 4, 5, 6, 8, 10 ); # And the first element can be greater than 0 # This is a pre-generated key candidate. # It may not be used up all of them # In practice I will create a large key list, # that should be greater than potential hash to be created my @key_list = ( 'A' .. 'Z' ); my $tolerance = 1; my $hoa; foreach my $nlist (@nlist) { my @tmpar = ($nlist[0]); my $first_elem = $nlist[0]; my $klist; if ( check_member( \@tmpar, $first_elem, $nlist, $tolerance ) == 1 + ) { push @tmpar, $nlist; $klist = shift @key_list; push @{ $hoa->{$klist} }, @tmpar; } } print Dumper $hoa; # -- Subroutine ------ sub check_member { # To check if a value can be # a member of an array my ( $alist, $fel, $snum,$tol ) = @_; my $centroid = $alist->[0]; if ( $centroid - $tol == $snum or $centroid + $tol == $snum or $centroid == $snum and $centroid != $fel ) { return 1; } return 0; }

Regards,
Edward

Replies are listed 'Best First'.
Re: Clustering Numbers with Overlapping Members
by GrandFather (Saint) on Aug 07, 2006 at 11:53 UTC

    Assuming no duplicates the following may do what you want:

    use warnings; use strict; my @nlist = (0,1,2,3,4,5,6,8,10); my @key_list = ('A'..'Z'); my $tolerance = 1; my %hoa; while (@nlist) { my $key = shift @key_list; my $previous = shift @nlist; last if ! @nlist; my $centroid = $nlist[0]; @{$hoa{$key}} = ($previous) if $previous >= $centroid - $toleranc +e; for (@nlist) { last if $_ > $centroid + $tolerance; push @{$hoa{$key}}, $_; } } print "$_ => [@{$hoa{$_}}]\n" for sort keys %hoa;

    Prints:

    A => [0 1 2] B => [1 2 3] C => [2 3 4] D => [3 4 5] E => [4 5 6] F => [5 6] G => [8] H => [10]

    DWIM is Perl's answer to Gödel

      Your method does not work for $tolerance > 1 - at least not how I understand the question :)

      My solution is based on the OP's statement

      So in principle every element in the array will be a centroid (except the first one, in this case it is 0).
      use warnings; use strict; my @nlist = (0,1,2,3,4,5,6,8,10); my @key_list = ('A'..'Z'); my $tolerance = 2; my %hoa; for my $i (1..$#nlist) { my $key = shift @key_list; $hoa{$key} = [grep in_range($nlist[$i], $_), @nlist ]; } print "$_ => [@{$hoa{$_}}]\n" for sort keys %hoa; sub in_range { my ($centroid, $testnum) = @_; return abs($centroid - $testnum) <= $tolerance; }
      which outputs (with tolerance 2):
      A => [0 1 2 3] B => [0 1 2 3 4] C => [1 2 3 4 5] D => [2 3 4 5 6] E => [3 4 5 6] F => [4 5 6 8] G => [6 8 10] H => [8 10]

      -- Hofmator

Re: Clustering Numbers with Overlapping Members
by GrandFather (Saint) on Aug 07, 2006 at 11:21 UTC

    Are duplicates allowed in the number list and are numbers constrained to be integer?

    What happens to hash keys if there are more than 26 keys?


    DWIM is Perl's answer to Gödel
      Dear GP,

      Are duplicates allowed in the number list and are numbers constrained to be integer?
      Yes they can be duplicates. But the array is always sorted in ascending order. So for example given @nlist = (0,0,1,2,3,3,4,5,6,8,8,10);. We would like to have such cluster:
      # I have manually align this based on the centroid Centroid | my $VAR1 = { #V 'A' => [0,0,1], 'B' => [0,0,1,2], 'C' => [1,2,3,3], 'D' => [2,3,3,4], 'E' => [3,3,4,5], 'F' => [4,5,6], 'G' => [5,6] 'H' => [8,8], 'I' => [10], };
      I'm sorry for not being clear in the first place.
      What happens to hash keys if there are more than 26 keys?
      As stated in my snippet. I have made sure that the cluster won't need more than 26 keys.

      Regards,
      Edward
        OK, if you want this behaviour, then the following code - slightly modified from my previous posting - should work.
        use warnings; use strict; my @nlist = (0,0,1,2,3,3,4,5,6,8,8,10); my @key_list = ('A'..'Z'); my $tolerance = 1; my %hoa; my %uniq; @uniq{@nlist[1..$#nlist]} = (); for my $centroid (sort {$a <=> $b} keys %uniq) { my $key = shift @key_list; $hoa{$key} = [grep in_range($centroid, $_), @nlist ]; } print "$_ => [@{$hoa{$_}}]\n" for sort keys %hoa; sub in_range { my ($centroid, $testnum) = @_; return abs($centroid - $testnum) <= $tolerance; }
        The idea is to iterate over the (unique) centroids - ignoring the very first element in @nlist - and extract all numbers 'in_range' from the original array.

        Update: Small bugfix.

        -- Hofmator

Re: Clustering Numbers with Overlapping Members
by Moron (Curate) on Aug 08, 2006 at 12:30 UTC
    This somewhat C-like solution is intended to avoid off-by-one errors at the beginning and end of the array. I have coded it as an array of array because that seemed logical and sufficient from the information so far -- the solution needs very little to make it a hash again. For example the operation chr(64+$_) would map the array indices in this AofA solution to the ascending alphabetical hash keys in the OP, should there be an undisclosed need for it to be a hash after all.
    my @aOfA=(); for ( my $i=0; $i<=$#nlist; $i++ ) { my @tmp = (); ( $i > 0 ) and (($nlist[$i] - $nlist[$i-1]) < 2 ) and $tmp[0]=$nlist[$i-1]; push @tmp, $nlist[$i]; ( $i < $#nlist ) and (($nlist[$i+1] - $nlist[$i]) < 2 ) and push @tmp, $nlist[$i+1]; push @aOfA, \@tmp; # Note that @tmp turns anonymous but doesn't dispose # at this reloop point }

    -M

    Free your mind

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://565919]
Approved by Corion
Front-paged by neversaint
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others exploiting the Monastery: (6)
As of 2024-04-23 14:25 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found