Clustering Numbers with Overlapping Members

monkfan has asked for the wisdom of the Perl Monks concerning the following question:

Dear Fellow Brother Monks,

Given a list of ordered (sorted ascending) number : @nlist = (0,1,2,3,4,5,6,8,10);
List of candidate hash key: my @key_list = ('A'..'Z');
And a tolerance value: my $tolerance = 1;

Update: The array @nlist may come with duplicate elements.
For example @nlist = (0,0,1,2,3,3,4,5,6,8,8,10);

I would like to cluster the numbers in @nlist into a hash. Here each hash will contain elements within the tolerance (+/- 1) and allowing recurring (overlapping) elements. This is provided they are within the tolerance value. So finally I would have something like this:


my $VAR1 = {
          'A' => [0,1,2],  # centroid is 1, with  other elem 0 and 2 (
+1 +/- 1)
          'B' => [1,2,3],
          'C' => [2,3,4],
          'D' => [3,4,5],
          'E' => [4,5,6],
          'F' => [5,6],    # centroid is 6 but only with one elem 5 (i
+.e. 6-1)
          'G' => [8],      # next centroid is 8 but no members
          'I' => [10],     # same for centroid 10
        };

  # As the @nlist array grow larger the final hash will also grow larg
+er.
[download]

So in principle every element in the array will be a centroid (except the first one, in this case it is 0).Then it finds its members within the prespecified tolerance value.

Update2:
I should add that when the very first element doesn't have its neighbour then it forms another cluster. In other words we ignore the first element only when it has neighbour within tolerance (see example below).

I am currently stuck with my code below. I really am not sure how to go about it.

use strict;              
use Data::Dumper;        
use Carp;                


# Size of this array  could be much greater than this 
my @nlist = ( 0, 1, 2, 3, 4, 5, 6, 8, 10 );

# And the first element can be greater than 0


# This is a pre-generated key candidate.
# It may not be used up all of them
# In practice I will create a large key list,
# that should be greater than  potential hash to be created
my @key_list = ( 'A' .. 'Z' );

my $tolerance = 1;

my $hoa;

foreach my $nlist (@nlist) {
    my @tmpar = ($nlist[0]);

    my $first_elem = $nlist[0];
    my $klist;

    if ( check_member( \@tmpar, $first_elem, $nlist, $tolerance ) == 1
+ ) {
        push @tmpar, $nlist;
        $klist = shift @key_list;
        push @{ $hoa->{$klist} }, @tmpar;
    }

}

print Dumper $hoa;

# -- Subroutine ------
sub check_member {

    # To check if a value can be
    # a member of an array
    my ( $alist, $fel, $snum,$tol ) = @_;

    my $centroid = $alist->[0];

    if (   $centroid - $tol == $snum
        or $centroid + $tol == $snum
        or $centroid == $snum
        and $centroid != $fel

        )
    {
        return 1;
    }

    return 0;
}
[download]

Regards,
Edward

Comment on Clustering Numbers with Overlapping Members Select or Download Code

Replies are listed 'Best First'.
Re: Clustering Numbers with Overlapping Members by GrandFather (Saint) on Aug 07, 2006 at 11:53 UTC
Assuming no duplicates the following may do what you want: `use warnings; use strict; my @nlist = (0,1,2,3,4,5,6,8,10); my @key_list = ('A'..'Z'); my $tolerance = 1; my %hoa; while (@nlist) { my $key = shift @key_list; my $previous = shift @nlist; last if ! @nlist; my $centroid = $nlist[0]; @{$hoa{$key}} = ($previous) if $previous >= $centroid - $toleranc +e; for (@nlist) { last if $_ > $centroid + $tolerance; push @{$hoa{$key}}, $_; } } print "$_ => [@{$hoa{$_}}]\n" for sort keys %hoa;` [download] Prints: `A => [0 1 2] B => [1 2 3] C => [2 3 4] D => [3 4 5] E => [4 5 6] F => [5 6] G => [8] H => [10]` [download] DWIM is Perl's answer to Gödel	[reply] [d/l] [select]
Re^2: Clustering Numbers with Overlapping Members by Hofmator (Curate) on Aug 07, 2006 at 12:25 UTC
Your method does not work for `$tolerance > 1` - at least not how I understand the question :) My solution is based on the OP's statement So in principle every element in the array will be a centroid (except the first one, in this case it is 0). `use warnings; use strict; my @nlist = (0,1,2,3,4,5,6,8,10); my @key_list = ('A'..'Z'); my $tolerance = 2; my %hoa; for my $i (1..$#nlist) { my $key = shift @key_list; $hoa{$key} = [grep in_range($nlist[$i], $_), @nlist ]; } print "$_ => [@{$hoa{$_}}]\n" for sort keys %hoa; sub in_range { my ($centroid, $testnum) = @_; return abs($centroid - $testnum) <= $tolerance; }` [download] which outputs (with tolerance 2): `A => [0 1 2 3] B => [0 1 2 3 4] C => [1 2 3 4 5] D => [2 3 4 5 6] E => [3 4 5 6] F => [4 5 6 8] G => [6 8 10] H => [8 10]` [download] -- Hofmator	[reply] [d/l] [select]
Re: Clustering Numbers with Overlapping Members by GrandFather (Saint) on Aug 07, 2006 at 11:21 UTC
Are duplicates allowed in the number list and are numbers constrained to be integer? What happens to hash keys if there are more than 26 keys? DWIM is Perl's answer to Gödel	[reply]
Re^2: Clustering Numbers with Overlapping Members by monkfan (Curate) on Aug 07, 2006 at 12:50 UTC
Dear GP, Are duplicates allowed in the number list and are numbers constrained to be integer? Yes they can be duplicates. But the array is always sorted in ascending order. So for example given `@nlist = (0,0,1,2,3,3,4,5,6,8,8,10);`. We would like to have such cluster: `# I have manually align this based on the centroid Centroid \| my $VAR1 = { #V 'A' => [0,0,1], 'B' => [0,0,1,2], 'C' => [1,2,3,3], 'D' => [2,3,3,4], 'E' => [3,3,4,5], 'F' => [4,5,6], 'G' => [5,6] 'H' => [8,8], 'I' => [10], };` [download] I'm sorry for not being clear in the first place. What happens to hash keys if there are more than 26 keys? As stated in my snippet. I have made sure that the cluster won't need more than 26 keys. Regards, Edward	[reply] [d/l] [select]
Re^3: Clustering Numbers with Overlapping Members by Hofmator (Curate) on Aug 07, 2006 at 14:01 UTC
OK, if you want this behaviour, then the following code - slightly modified from my previous posting - should work. `use warnings; use strict; my @nlist = (0,0,1,2,3,3,4,5,6,8,8,10); my @key_list = ('A'..'Z'); my $tolerance = 1; my %hoa; my %uniq; @uniq{@nlist[1..$#nlist]} = (); for my $centroid (sort {$a <=> $b} keys %uniq) { my $key = shift @key_list; $hoa{$key} = [grep in_range($centroid, $_), @nlist ]; } print "$_ => [@{$hoa{$_}}]\n" for sort keys %hoa; sub in_range { my ($centroid, $testnum) = @_; return abs($centroid - $testnum) <= $tolerance; }` [download] The idea is to iterate over the (unique) centroids - ignoring the very first element in `@nlist` - and extract all numbers 'in_range' from the original array. Update: Small bugfix. -- Hofmator	[reply] [d/l] [select]
Re^4: Clustering Numbers with Overlapping Members by monkfan (Curate) on Aug 07, 2006 at 14:35 UTC
Re: Clustering Numbers with Overlapping Members by Moron (Curate) on Aug 08, 2006 at 12:30 UTC
This somewhat C-like solution is intended to avoid off-by-one errors at the beginning and end of the array. I have coded it as an array of array because that seemed logical and sufficient from the information so far -- the solution needs very little to make it a hash again. For example the operation chr(64+$_) would map the array indices in this AofA solution to the ascending alphabetical hash keys in the OP, should there be an undisclosed need for it to be a hash after all. `my @aOfA=(); for ( my $i=0; $i<=$#nlist; $i++ ) { my @tmp = (); ( $i > 0 ) and (($nlist[$i] - $nlist[$i-1]) < 2 ) and $tmp[0]=$nlist[$i-1]; push @tmp, $nlist[$i]; ( $i < $#nlist ) and (($nlist[$i+1] - $nlist[$i]) < 2 ) and push @tmp, $nlist[$i+1]; push @aOfA, \@tmp; # Note that @tmp turns anonymous but doesn't dispose # at this reloop point }` [download] -M Free your mind	[reply] [d/l]


P is for Practical
	PerlMonks