Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

Re: Reading values from one .csv, searching for closest values in second .csv, returning results in third .csv?

by 7stud (Deacon)
on Feb 23, 2013 at 22:22 UTC ( #1020347=note: print w/ replies, xml ) Need Help??


in reply to Reading values from one .csv, searching for closest values in second .csv, returning results in third .csv?

I would think a binary search of the array would be a lot more efficient than a linear pass when n=500,000, which might take on average 250,000 comparisons to locate the range.

pickleswarlz, If the term 'binary search' scares you, there is an exercise in "Learning Perl" where you write a program that picks a number between 0-100 and then invites a user to guess the number. If the user starts off by picking 50, then the program should respond by saying 'higher' or 'lower'. What is the quickest way to guess the number that the program picked? Well, if the computer says higher, you should pick 75 next. Then if the computer says lower, you should pick 62 or 63 next. Do you see how halving the range each time will arrive at guessing the correct number the quickest? Try it with a friend. It will only take you 4-5 tries to guess the number if each time you guess the middle of the range that you know contains the number. That process is known as a 'binary search'.

The function below uses a binary search to return the (lowest) numbers in an array that bracket a specified number. Once you have all your ID's in a sorted array, you can search the array for the two numbers that bracket a given number:

sub binary_search { #Called like this: #binary_search(\@data, $#data, $num); my ($aref, $high_index, $target) = @_; my $low_index = 0; if ( $target < $aref->[$low_index] or $target > $aref->[$high_index] ) { die "Target not in range." } my ($middle_index, $middle_value); while ($low_index <= $high_index) { $middle_index = int( ($high_index + $low_index)/2 ); $middle_value = $aref->[$middle_index]; print '.'; #print a dot for each attempt to locate target if ($middle_value < $target) { $low_index = $middle_index + 1; } elsif ($middle_value > $target) { $high_index = $middle_index - 1; } elsif ($middle_value == $target) { print "*"; #Indicates the target is actually in @data return $middle_index == 0 ? ($middle_value, $aref->[1]) : ($aref->[$middle_index-1], $middle_value) ; } } while ($aref->[$low_index] >= $target) { --$low_index; } return $aref->[$low_index], $aref->[$low_index+1]; } say ''; my @data = (1, 5, 6, 7, 8, 9, 10, 20, 30, 33, 34, 38, 41, 100, 101, 10 +2, 110); for my $num (1 .. $data[-1]) { my @results = binary_search(\@data, $#data, $num); say "$num: @results"; } --output:-- ....*1: 1 5 ....2: 1 5 ....3: 1 5 ....4: 1 5 ...*5: 1 5 ....*6: 5 6 ..*7: 6 7 ....*8: 7 8 ...*9: 8 9 ....*10: 9 10 .....11: 10 20 .....12: 10 20 .....13: 10 20 .....14: 10 20 .....15: 10 20 .....16: 10 20 .....17: 10 20 .....18: 10 20 .....19: 10 20 .....*20: 10 20 .....21: 20 30 .....22: 20 30 .....23: 20 30 .....24: 20 30 .....25: 20 30 .....26: 20 30 .....27: 20 30 .....28: 20 30 .....29: 20 30 .*30: 20 30 ....31: 30 33 ....32: 30 33 ....*33: 30 33 ...*34: 33 34 ....35: 34 38 ....36: 34 38 ....37: 34 38 ....*38: 34 38 ....39: 38 41 ....40: 38 41 ..*41: 38 41 ....42: 41 100 ....43: 41 100 ....44: 41 100 ....45: 41 100 ....46: 41 100 ....47: 41 100 ....48: 41 100 ....49: 41 100 ....50: 41 100 ....51: 41 100 ....52: 41 100 ....53: 41 100 ....54: 41 100 ....55: 41 100 ....56: 41 100 ....57: 41 100 ....58: 41 100 ....59: 41 100 ....60: 41 100 ....61: 41 100 ....62: 41 100 ....63: 41 100 ....64: 41 100 ....65: 41 100 ....66: 41 100 ....67: 41 100 ....68: 41 100 ....69: 41 100 ....70: 41 100 ....71: 41 100 ....72: 41 100 ....73: 41 100 ....74: 41 100 ....75: 41 100 ....76: 41 100 ....77: 41 100 ....78: 41 100 ....79: 41 100 ....80: 41 100 ....81: 41 100 ....82: 41 100 ....83: 41 100 ....84: 41 100 ....85: 41 100 ....86: 41 100 ....87: 41 100 ....88: 41 100 ....89: 41 100 ....90: 41 100 ....91: 41 100 ....92: 41 100 ....93: 41 100 ....94: 41 100 ....95: 41 100 ....96: 41 100 ....97: 41 100 ....98: 41 100 ....99: 41 100 ....*100: 41 100 ...*101: 100 101 ....*102: 101 102 .....103: 102 110 .....104: 102 110 .....105: 102 110 .....106: 102 110 .....107: 102 110 .....108: 102 110 .....109: 102 110 .....*110: 102 110

And after reading up on the cpan module List::BinarySearch, whose author appears to really know how to construct efficient search algorithms, I would consider using that module's bsearch_num_pos() function, which returns the optimal insertion index for a given number:

 # The following [function returns] an optimal insert point if no match.

    my @num_array =   ( 100, 200, 300, 400 );

    $index = bsearch_num_pos 500, @num_array; # Returns 4 - Best insert-at position.

Given the index for the insertion, you can get the range the number falls in with $data[$index-1]and $data[$index]. You just have to adjust for the case when the insertion point is 0, like I did in my code.

use List::BinarySearch qw(bsearch_num_pos); ... ... sub cpan_binary_search { my ($aref, $high_index, $target) = @_; my $low_index = 0; if ( $target < $aref->[$low_index] or $target > $aref->[$high_index] ) { die "Target not in range." } my $index = bsearch_num_pos($target, @$aref); return $index == 0 ? ($aref->[0], $aref->[1]) : ($aref->[$index-1], $aref->[$index]); } my @data = (1, 5, 6, 7, 8, 9, 10, 20, 30, 33, 34, 38, 41, 100, 101, 10 +2, 110); for my $num (1 .. $data[-1]) { my @results = cpan_binary_search(\@data, $#data, $num); say "$num: @results"; }

I'll have to run this Perl script on multiple files, so I'd love to be able to specify the input and output files in the command line as I go rather than specifying them in the script itself.

Any command line arguments can be found in the @ARGV array:

use strict; use warnings; use 5.012; say for @ARGV; say "\n"; my ($in, $out) = @ARGV; say $in; say $out; --output:-- $ perl prog1.pl infile_name outfile_name 10 20 infile_name outfile_name 10 20 infile_name outfile_name


Comment on Re: Reading values from one .csv, searching for closest values in second .csv, returning results in third .csv?
Select or Download Code

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1020347]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others avoiding work at the Monastery: (7)
As of 2014-08-01 10:42 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Who would be the most fun to work for?















    Results (4 votes), past polls