Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

sorting an array of hashes and removing duplicates

by jmfees (Initiate)
on Apr 01, 2010 at 23:36 UTC ( #832402=perlquestion: print w/ replies, xml ) Need Help??
jmfees has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I've been trying to solve this challenge for quite awhile, now I'm hoping a Perl expert out there has an elegant solution.

I build an array of hashes from a database table, calculating mileage distance as I insert each record into the structure. Then I sort the array of hashes to come up with the records that have the smallest distance from each client. each tutor has multiple records tied to them thru their ID, but I only want the closest record associated with each. The rest I want to remove from the structure.

In the example below, I only want one of tutor ID 24 and one of ID 50 to show up in the final result.

I'm assuming there may be some way to do this within the sort line I use below, but everything I've tried so far hasn't worked.

Here are my inputs, outputs and code snippets: (thanks in advance for anyone's help!)

Database records: ID, tutorID, Latitude, Longitude, City, State, Zipcode 1, 24, 39.948927, -75.1586630, Philadelphia, Pennsylvania, 19107 2, 24, 44.813947, -93.9194250, Young America, Minnesota, 55555 3, 24, 34.270548, -119.217135, Ventura, California, 93003 5, 6, 35.938033, -78.1193720, Spring Hope, North Carolina, 27882 6, 50, 32.357439, -111.044532, Tucson, Arizona, 85741 7, 5, 41.299209, -96.0521530, Omaha, Nebraska, 68134 8, 50, 38.921460, -119.117640, Yerington, Nevada, 89447 9, 7, 40.798205, -73.7382850, Great Neck, New York, 11027 ################## Perl code snippet ###################### $i=0; $sql = qq|select ID, tutorID, Latitude, Longitude, City, State, Zipcod +e from $table_tzips where Zipcode IS not NULL|; $sth = $dbh->prepare($sql); $sth->execute(); $tzips = $sth->fetchrow_hashref(); while (defined $tzips) { $trdist = &calculate_distance($clong, $clat, $tzips->{Longitude}, +$tzips->{Latitude}); ($tutorsdata[$i]{trID}, $tutorsdata[$i]{tutorsID}, $tutorsdata[$i]{tmiles}, $tutorsdata[$i]{city}, $tutorsdata[$i]{state}, $tutorsdata[$i]{zipcode}) = ($tzips->{ID}, $tzips->{tutorID}, $trdist, $tzips->{City}, $tzips->{State}, $tzips->{Zipcode}); $i++; $tzips=$sth->fetchrow_hashref(); } $sth->finish; @sortthetutors = sort{ $$a{tmiles} <=> $$b{tmiles} } @tutorsdata; print "Tutor Distances:\n\n"; for $printtutors(@sortthetutors) { print "$$printtutors{trID}, $$printtutors{tutorsID}, $$printtutors{tmiles}, $$printtutors{city}, $$printtutors{state}, $$printtutors{zipcode} \n"; } ################## END Perl code snippet ###################### Current output of this: 3, 24, 0, Ventura, California, 93003 8, 50, 321, Yerington, Nevada, 89447 6, 50, 489, Tucson, Arizona, 85741 7, 5, 1348, Omaha, Nebraska, 68134 2, 24, 1522, Young America, Minnesota, 55555 5, 6, 2307, Spring Hope, North Carolina, 27882 1, 24, 2432, Philadelphia, Pennsylvania, 19107 9, 7, 2502, Great Neck, New York, 11027 Desired output: 3, 24, 0, Ventura, California, 93003 8, 50, 321, Yerington, Nevada, 89447 7, 5, 1348, Omaha, Nebraska, 68134 5, 6, 2307, Spring Hope, North Carolina, 27882 9, 7, 2502, Great Neck, New York, 11027

Again, thanks!

Comment on sorting an array of hashes and removing duplicates
Download Code
Re: sorting an array of hashes and removing duplicates
by ikegami (Pope) on Apr 01, 2010 at 23:47 UTC
    use strict; use warnings; ... my $sql = qq| SELECT ID, tutorID, Latitude, Longitude, City, State, Zipcode FROM $table_tzips WHERE Zipcode IS not NULL |; my $sth = $dbh->prepare($sql); my %closest; $sth->execute(); while (my $tzips = $sth->fetchrow_hashref()) { my $tutorID = $tzips->{tutorID}; my $dist = calculate_distance( $clong, $clat, $tzips->{Longitude}, $tzips->{Latitude} ); next if $closest{$tutorID} && $closest{$tutorID}{dist} <= $dist; $closest{$tutorID} = { ID => $tutorID, tzipID => $tzips->{ID}, Dist => $dist, City => $tzips->{City}, State => $tzips->{State}, Zipcode => $tzips->{Zipcode}, }; } my @report_fields = qw( tzipID ID Dist City State Zipcode ); for my $tutor ( sort{ $a->{dist} <=> $b{dist} } values(%closest) ) { print(join(', ', @{$tutor}{@report_fields}), "\n"); }
      or ...
      my $sql = qq| SELECT ID, tutorID, Latitude, Longitude, City, State, Zipcode FROM $table_tzips WHERE Zipcode IS not NULL ORDER BY tutorID |; my $last_tutor_id = ""; my $last_dist; my @tutors; my $sth = $dbh->prepare($sql); $sth->execute(); while (my $tzips = $sth->fetchrow_hashref()) { my $tutor_id = $tzips->{tutorID}; my $dist = calculate_distance( $clong, $clat, $tzips->{Longitude}, $tzips->{Latitude} ); if ($last_tutor_id ne $tutor_id) { $last_tutor_id = $tutor_id; push @tutors, $tzips; } else { next if $last_dist <= $dist; $tutors[-1] = $tzips; } $tzips->{Dist} = $last_dist = $dist; } my @report_fields = qw( tzipID ID Dist City State Zipcode ); for my $tutor ( sort{ $a->{Dist} <=> $b{Dist} } @tutors) ) { print(join(', ', @{$tutor}{@report_fields}), "\n"); }
      In the OP case, where the full data has to be keep in memory for the final sorting, this approach may be just more complicated than ikegami's one and also imposes an extra load in the database.

      But when you can output the data as you go and don't need to keep it in memory for a final processing stage, it may perform better, specially for large data sets, as it's memory usage is fixed O(1), not dependant on the data set size O(N).

Re: sorting an array of hashes and removing duplicates
by Enlil (Parson) on Apr 02, 2010 at 00:04 UTC
    Another way to do it might be to just run the output of the sort through another filter like so:
    my %seen; my @sortthetutors = grep { !$seen{$_->{tutorsID}}++ } sort { $a->{tmiles} <=> $b->{tmiles} } @tutorsdata;
    -enlil

      This one worked beautifully, I tested it on 80,000 records, it took just a couple seconds...

      Thanks!!!

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://832402]
Approved by ikegami
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chanting in the Monastery: (6)
As of 2014-12-27 05:19 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (176 votes), past polls