How to extract the best match from matrix

jnarayan81 has asked for the wisdom of the Perl Monks concerning the following question:

I have a matrix with human features (f1 .. f4)

Name    f1    f2    f3    f4
africa1    18    1    48    23
usa2    48    23    60    23
africa2    17    3    49    25
africa3    20    6    52    30
usa1    55    20    56    25
china1    35    37    55    87
china2    40    33    50    73
[download]

I wanted to query with selected feature (f1 ..f4), and print the best match.

use strict;
use warnings;

my @array = (37,35,59,70); #Are in the same order as in matrix f1 f2 f
+3 f4

for(my $i = 1; $i <= 4; $i++) {
    my %hash; my @allval;
    my $fh=read_fh($ARGV[0]); #Matrix file here
    while(<$fh>) {
        chomp; next if $. ==1; #Ignore header
        my @val = split('\t', $_);
        $hash{$val[0]}=$val[$i];
        push (@allval, $val[$i]);
    }
    close $fh;
#foreach (sort keys %hash) { print "$_ : $hash{$_}\n";}

my @allval_sorted = sort {$a <=> $b} @allval;
my $find =$array[$i-1];
#my $nearest = @{nearest(\@allval_sorted)}[0];
use List::Util 'max'; my $selected = max grep { $find >= $_ } @allval_
+sorted; my $nearest=$selected;

my @keys = grep { $hash{$_} == $nearest } keys %hash;
print "nearest to $find in array is: $nearest  and name is $keys[0]\n"
+;
undef @allval; undef %hash;

sub nearest {
  my ($a) = @_;  
  my $size = @$a;
  return $a if $size == 1;
  my $mid = int(($size-1) / 2);
  my $test = @$a[$mid];  
  return $test <= $find ?
    (abs($test-$find)<abs(@$a[$mid+1]-$find) ? [$test] : 
       $find <= @$a[$mid+1] ? [@$a[$mid+1]] : nearest([@$a[$mid+1 .. $
+size-1]]))
    :
    (abs($test-$find)<abs(@$a[$mid-1]-$find) ? [$test] : 
       $find >= @$a[$mid-1] ? [@$a[$mid-1]] : nearest([@$a[0 .. $mid]]
+));
}

}


#Open and Read a file
sub read_fh {
    my $filename = shift @_;
    my $filehandle;
    if ($filename =~ /gz$/) {
        open $filehandle, "gunzip -dc $filename |" or die $!;
    }
    else {
        open $filehandle, "<$filename" or die $!;
    }
    return $filehandle;
}
[download]

I wrote script and it print followings; It seems nearest function does not work !!!

WITH NEAREST sub
jitendra@jitendra-Aspire-S3-391[test] perl testMAT.pl mat.txt         
+[10:17PM]
nearest to 37 in array is: 35  and name is china1
nearest to 35 in array is: 37  and name is china1
nearest to 59 in array is: 48  and name is africa1
nearest to 70 in array is: 30  and name is africa3

WITH use List::Util 'max' 
jitendra@jitendra-Aspire-S3-391[test] perl testMAT.pl mat.txt         
+[10:19PM]
nearest to 37 in array is: 35  and name is china1
nearest to 35 in array is: 33  and name is china2
nearest to 59 in array is: 56  and name is usa1
nearest to 70 in array is: 30  and name is africa3
[download]

As you can see, I am opening and closing the matrix file four time, which would be really time consuming on long file.

Is there any other SMART way to achieve the best result? Thanks for you time.

Comment on How to extract the best match from matrix Select or Download Code

Replies are listed 'Best First'.
Re: How to extract the best match from matrix by kevbot (Vicar) on Mar 12, 2017 at 01:24 UTC
Hi jnarayan81, As pme and choroba mentioned, you will get more/better help if you describe the objective of your `nearest` function. I did not closely inspect your code, but I took a guess at what you were trying to do and came up with the following example. #!/usr/bin/env perl use strict; use warnings; use Data::Table; # Create a Data::Table with headers (assuming data is tab-delimited) my $dt = Data::Table::fromTSV( 'mat.txt', 1 ); # Get the number of rows in the Data::Table my $n_rows = $dt->nofRow; my $query = [ 37,35,59,70 ]; my $nearest_name = ''; my $min_dist; foreach my $i (0..$n_rows - 1){ my $row_ref = $dt->rowRef($i); # Get row of Data::Table as an ARRA +Y REF my $name = shift @{$row_ref}; # The name is in the first column my $dist = dist($query, $row_ref); $min_dist = !defined($min_dist) ? $dist : $dist < $min_dist ? $dist : $min_dist; $nearest_name = $dist <= $min_dist ? $name : $nearest_name; } print "The nearest to: "; print join(", ", @{$query}); print " is: $nearest_name\n"; exit; # Calculate the Euclidean distance between two vectors sub dist { my ($x, $y) = @_; unless(ref($x) eq 'ARRAY' and ref($y) eq 'ARRAY'){ die "Vectors must be given as array references" } unless (scalar @{$x} == scalar @{$y}) { die "Vectors are not of equal length"; } my $sum_sq = 0; my $len = scalar @{$x}; foreach my $i (0..$len - 1) { $sum_sq += ($x->[$i] - $y->[$i])*2; } return sqrt($sum_sq); } [download] I like the Data::Table module for manipulating tabular data, but there are many other ways to load/manipulate your data. If your definition of distance* or best match is not euclidean distance then modify the `dist` subroutine accordingly. UPDATE: Originally the `dist` sub returned `$sum_sq`. I changed it to return the correct euclidean distance which is `sqrt($sum_sq)`.	[reply] [d/l] [select]
Re^2: How to extract the best match from matrix by jnarayan81 (Sexton) on Mar 12, 2017 at 23:02 UTC
You are so to the point. Exactly, this is what I wanted to do.	[reply]
Re: How to extract the best match from matrix by choroba (Cardinal) on Mar 11, 2017 at 23:39 UTC
Can you describe what the function is supposed to do? Using strict is good, but declaring `sub nearest` inside of a `for` loop with the variable `$find` used in the subroutine while declared in the outer scope nearly spurns all its benefits. ($q=q:Sq=~/;[c](.)(.)/;chr(-\|\|-\|5+lengthSq)`"S\|oS2"`map{chr \|+ord }map{substrSq`S_+\|`\|}3E\|-\|`7**2-3:)=~y+S\|`+$1,++print+eval$q,q,a, [download]	[reply] [d/l] [select]
Re: How to extract the best match from matrix by Marshall (Canon) on Mar 12, 2017 at 23:23 UTC
Hi jnarayan81, Yes, in general re-reading an input file multiple times is a bad idea. I/O is very "expensive" both in terms of CPU time, but also in terms of clock time. A simple way to read the input file once, could be: `#!/usr/bin/perl use strict; use warnings; use Data::Dumper; my @matrix; # Array of Array, ["africa1",18,1,48,23] <DATA>; # throw away first line while (<DATA>) { next if /^\s*$/; # skip blank lines push (@matrix, [split(/\s+/,$_)]); # create AoA (2-D matrix) } print Dumper \@matrix; __DATA__ Name f1 f2 f3 f4 africa1 18 1 48 23 usa2 48 23 60 23 africa2 17 3 49 25 africa3 20 6 52 30 usa1 55 20 56 25 china1 35 37 55 87 china2 40 33 50 73 africa4 18 2 47 23` [download] The above may or may not be the best data structure for your search requirements. But I suggest reading the input file only once. It appears to me that you are comparing columns. I think that you want an output that is based "best match(es)" for each each column. I added "africa4" above. What is supposed to happen if: `my @array = (18,35,59,70);` Should this first array value of 18 result in two matches in the first data column?, africa1 and africa4?? It is completely plausible that 16 and 20 are both equally close to 18. What should the output be in those cases?	[reply] [d/l] [select]
Re: How to extract the best match from matrix by pme (Monsignor) on Mar 11, 2017 at 22:25 UTC
Hi jnarayan81, Would you tell us how to interpret 'best match' and 'nearest' in this context?	[reply]
Re^2: How to extract the best match from matrix by jnarayan81 (Sexton) on Mar 12, 2017 at 12:05 UTC
Nearest: Closest match to the other individual features. query:f1->56 check in f1 column and extract the closest match values and its name. BestMatch: Where does my query `my @array = (37,35,59,70); #Are in the same order as in matrix f1 f2 f +3 f4` [download] fall in matrix. Are features belongs to USA or Africa, China?	[reply] [d/l]


go ahead... be a heretic
	PerlMonks