Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical

Data Matching Challenge

by expo (Initiate)
on Feb 01, 2007 at 20:15 UTC ( #597811=perlquestion: print w/replies, xml ) Need Help??
expo has asked for the wisdom of the Perl Monks concerning the following question:

I am trying to compare two data sets and pull out the matches based on an id. I could use a little wisdom for this :)

Dataset A A "Monday" B "Tuesday" C "Wednesday" D "Thursday" Dataset B M 252 212 "Bill" A 325 908 "Jim" C 426 907 "Mike" A 423 383 "Sally" A 993 421 "Jim" C 737 432 "Mary"

The goal would be to merge these together such that redundant ids (first fields) are included and those not present in Dataset A are excluded. So the merging and filtering of the data would look something like this:

A "Monday" 325 908 "Jim" A "Monday" 423 383 "Sally" A "Monday" 993 421 "Jim" C "Wednesday" 737 432 "Mary" C "Wednesday" 737 432 "Mary"

Now, I could easily iterate through two arrays side by side and do a pattern match BUT the problem is speed. I have enormous amounts of data that I need to mine through so it needs to be pretty fast.

I started making a hash table but you need a unique id which is problematic the keys need to be unique and I am interested redundant matches. I started building a matrix using anonymous arrays but it started getting clumsy and I know there is a more elegant way to do this.

Any ideas or suggestions would be greatly appreciated!! Expo

Replies are listed 'Best First'.
Re: Data Matching Challenge
by GrandFather (Sage) on Feb 01, 2007 at 20:59 UTC

    Use a hash to check for hits from the smaller data set (A in the sample) and generate a composite list:

    use strict; use warnings; my @DataA = split /\n/, <<DATAA; A "Monday" B "Tuesday" C "Wednesday" D "Thursday" DATAA my @DataB = split /\n/, <<DATAB; M 252 212 "Bill" A 325 908 "Jim" C 426 907 "Mike" A 423 383 "Sally" A 993 421 "Jim" C 737 432 "Mary" DATAB my %AHits; my @merged; for (@DataA) { my ($key, $day) = split /\s+/; $AHits{$key} = $day; } for (@DataB) { my ($key, @params) = split /\s+/; next unless exists $AHits{$key}; push @merged, [$key, $AHits{$key}, @params]; } print join ("\t", @$_), "\n" for sort {$a->[0] cmp $b->[0]} @merged;


    A "Monday" 325 908 "Jim" A "Monday" 423 383 "Sally" A "Monday" 993 421 "Jim" C "Wednesday" 426 907 "Mike" C "Wednesday" 737 432 "Mary"

    DWIM is Perl's answer to Gödel
      This is very nice, fairly straightforward and clean. I will see if I can implement this approach. Thanks a bunch!! Expo
Re: Data Matching Challenge
by Tanktalus (Canon) on Feb 01, 2007 at 20:26 UTC

    This looks like something I'd pull out a bit of DBI and DBD::CSV (don't let the name fool ya - you should be able to use tabs or spaces or whatever as separators, not just commas), and a small bit of SQL for. I'm sure you could do faster by hand, but I'm not sure it would be as bug-free, nor would it be nearly so quick to develop.

Re: Data Matching Challenge
by jwkrahn (Monsignor) on Feb 02, 2007 at 02:10 UTC
    Perhaps this will be fast enough for you:
    #!/usr/bin/perl use warnings; use strict; use Fcntl ':seek'; open DATASET_A, '<', 'File_A' or die "Cannot open 'File_A' $!"; open DATASET_B, '<', 'File_B' or die "Cannot open 'File_B' $!"; my %dataB; my $tell = tell DATASET_B; while ( <DATASET_B> ) { if ( /^(\S+)/ ) { push @{ $dataB{ $1 } }, $tell + length $1; } $tell = tell DATASET_B; } while ( <DATASET_A> ) { chomp; my ( $key ) = /^(\S+)/; if ( exists $dataB{ $key } ) { for my $seek ( @{ $dataB{ $key } } ) { seek DATASET_B, $seek, SEEK_SET; print $_, scalar <DATASET_B>; } } } __END__
Re: Data Matching Challenge
by roboticus (Chancellor) on Feb 02, 2007 at 03:09 UTC

    Yet another way to do it:

    You might want to sort both files on the key of interest. Then you can scan through the files side-by-side and print the appropriate items--just like a merge sort.

    I recently responded to a similar question which you might want to look at if you're interested in this method--it has an example program in it that could easily be modified to handle your problem.


      Thanks Roboticus, I will take a look!
Re: Data Matching Challenge
by runrig (Abbot) on Feb 01, 2007 at 20:20 UTC

    Update: Hmm. Not reading carefully enough...join (Unix command line join, not perl function join) could be made to work, but you said you had arrays, though you did say you had enormous amounts of data...

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://597811]
Approved by Joost
and the fog begins to lift...

How do I use this? | Other CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (6)
As of 2017-11-17 23:37 GMT
Find Nodes?
    Voting Booth?
    In order to be able to say "I know Perl", you must have:

    Results (276 votes). Check out past polls.