Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Re: Most efficient record selection method?

by rir (Vicar)
on Feb 18, 2009 at 20:55 UTC ( #744882=note: print w/ replies, xml ) Need Help??


in reply to Most efficient record selection method?

Update: Major update. I am going to reexamine putting my $HOME under revision control.

Here is the algorithm.

  • For each value in a field, keep track of number of occurences in the input. Set the frequency to zero for the values when a record is moved to the output. So:
    $freq->{$fld_no}->{ $record[$fld_num] } = 0;
    is a central fragment. The score zero means the data field value is no longer needed; a score equal one means the value is needed in the results; and a score greater than one means the value is needed but which record is undetermined.
  • Repeat on the input:
    • Sort it, remove dupes.
    • Move records known to be needed.
    • Delete unneeded records.
    • A record is a duplicate if all fields are equal or the fields' scores are both zero.
      Update: That's not so clear; an example (FieldValue:FieldScore):
      Your:0 Horse:4 Blue:0
      My:0 Horse:4 Black:0
      match because the ownerships and colors don't matter.
    • A record is needed if any significant field is unique (has a score equal to one).
    • A record is deletable if one significant field has a score greater than one and the rest are zero. This is just a directly identifiable duplicate.
    • That removes the easy stuff, what is in the result needs to be added to any final solution of what remains in the input. The rest is brute force or the long walk.
    • It may be possible to split the input into smaller problems. This would involve grabbing a record and all its sisters and cousins. If we initially select: Peace Happiness Joy then we choose every record with Peace in field one, Happiness--two, Joy--three; then we repeat for every chosen record. There's no need to guess at a good starting choice, either there will be partitions or not.
    • Examine the head of every permutation of the input set, looking for the ideal solution. The ideal solution's size would be equal to the largest count of differing values for a field. If there is no ideal solution, increase the size of the head until a solution is found. In doing this count the previously created output with the head. In these situations, I tend to look for a solution then for an improvement; it allows quitting with a sub-optimal solution.

      In the above, I write of an ideal solution size. Determining the best possible and worst possible size of our answer allows breaking off processing when it's pointless.

    Possibly change the permutation machine so that each "permutation" array is treated as a wheel, reducing the number of arrays needing to be created. I think the walking is the time sink; not creating the walkways.

    Here's unguaranteed code.

    #!/usr/bin/perl use warnings; use strict; use FileHandle; use Clone 'clone'; use Data::Compare; $|++; my @files = qw( File1 File2 File3 ); my $sig_flds = [ 0 .. 2 ]; # config, the fields that we care about @$sig_flds = sort @$sig_flds; my $answer = []; my $freq = {}; my $input; # AoA ref # put data into an array for my $file (@files) { my $fh = FileHandle->new( $file, 'r' ) or die "Cannot open $file"; scalar <$fh>; # eat header line my $data = [<$fh>]; chomp @$data; $_ = [ parse_csv( $_, ',' ) ] for @$data; close $fh or die "Cannot close $file"; push @$input, @$data; } my $sanity_check = make_ck_list( $sig_flds, $input); sub make_ck_list { my ( $sig_flds, $input) = @_; my %answer_check; for my $rec ( @$input ) { for my $fld ( @$sig_flds ) { $answer_check{ $fld }{ $rec->[$fld] } = 1; } } return \%answer_check; } print "Orig unscored input size(", scalar(@$input), "):$/"; display( $sig_flds, $input, $freq ); sample( $sig_flds, $input, $answer, $freq ); print "Solution size(", scalar(@$answer), "):$/"; display( $sig_flds, $answer, $freq ); my $answer_list = make_ck_list( $sig_flds, $answer); print "Bad answer$/" unless Compare( $answer_list, $sanity_check ); exit; # create a smallest possible sample with all values represented. sub sample { my ( $sig_flds, $input, $answer, $freq ) = @_; local $_; my $part_answer = []; $input = _reduce_input( $sig_flds, $input, $part_answer, $freq ); # are we lucky/done ? return do { @$answer = @$part_answer; return } if !defined $input; # reassemble data w/o duplicates @$input = ( @$input, @$part_answer ); # partition into separate problems my $partition = partition( $sig_flds, $input ); # solve each problem, dumping all the solutions together # process each partition my $ct; for my $part (@$partition) { # print "Starting partition", ++$ct, $/; $freq = {}; $part_answer = []; # this does a little extra work this time $input = _reduce_input( $sig_flds, $part, $part_answer, $freq +); solve_part( $sig_flds, $input, $part_answer, $freq ) if defined $input; push @$answer, @$part_answer; $input = []; } return; } # solve_part -- solve a partition of data by going # through all permutations until a good solution # is found sub solve_part { my ( $sig_flds, $input, $answer, $freq ) = @_; $answer = [] if not defined $answer; my $part_answer = $answer; # factor to predict size of solution my ( $best_possible_size, $worst_possible_size ) = size_answer( $sig_flds, $input, $answer, $freq ); # master needs list; my %master_needs = %$freq; # delete @$part_answer items from needs for my $rec (@$part_answer) { for my $fld (@$sig_flds) { if ( exists $master_needs{$fld}->{ $rec->[$fld] } ) { delete $master_needs{$fld}->{ $rec->[$fld] }; } } } my $iter_perms = permute(@$input); my $cur_best_size = $worst_possible_size; my $best_answer; my $total_perms = factorial( scalar @$input ); my $ct; while ( my @perm = $iter_perms->() ) { ++$ct; my $curr_ans; my $needs = clone( \%master_needs ); @$curr_ans = @$part_answer; while (@perm) { my $curr_rec = pop @perm; if ( useful( $sig_flds, $curr_rec, $needs ) ) { push @$curr_ans, $curr_rec; for my $fld (@$sig_flds) { delete $needs->{$fld}{ $curr_rec->[$fld] }; } } } if ( $ct % 1000 == 0 ) { # XXX #print # " Permutation: $ct/$total_perms " # . "target size: $best_possible_size best $cur_best_si +ze$/"; } if ( empty( $sig_flds, $needs ) ) { if ( @$curr_ans < $cur_best_size ) { $best_answer = $curr_ans; $cur_best_size = @$curr_ans; #print "$/Permutation $ct$/"; #print "We have a new contender: "; #print "Target size: $best_possible_size "; #print "Contender's size: $cur_best_size$/"; if ( $cur_best_size <= $best_possible_size ) { @$answer = @$best_answer; return; } #display( $sig_flds, $best_answer, $freq ); } } } @$answer = @$best_answer; return; } # size_answer -- look at the data and determine: # 1. the smallest possible size of an answer (or less if we must gues +s), # 2. the largest possible size of an answer (or more if we must guess +), # This is an optimization: if smallest is accurate, we can break off # when one of the best solutions is found; if largest is accurate we # can stop processing a permutation when its solution is too large. # I haven't considered how well this can be estimated. # , the smallest is the size of the largest set of # data values in a significant field; and the largest is the greater +of # the input size or ( sum of significant field sets plus scalar @$sig_ +flds # minus one. # sub size_answer { my ( $sig_flds, $input, $answer, $freq ) = @_; my ( $best_possible_size, $worst_possible_size ); # XXX stub return ( 1, 1000 ); } # _reduce_input -- Remove duplicate records and cull # items that are easy to determine are unneeded. # Put items that are definitely needed into @$part_answer. # Delete culls from @$input. And build occurrence %$frequency. sub _reduce_input { my ( $sig_flds, $input, $part_answer, $freq ) = @_; local $_; # create frequency info for my $fld (@$sig_flds) { ++$freq->{$fld}->{ $_->[$fld] } for @$input; } my $start; do { return if not defined @$input; $start = @$input; # progress flag # sort and remove dupes @$input = sort { cmp_on_flds( $sig_flds, $freq, $a, $b ) } @$i +nput; $input = unique_on_flds( $sig_flds, $input, $freq ); my $progress; do { $progress = 0; $_ = -1; while ( ref $input && $_ < @$input - 1 ) { $_++; if ( moveable( $_, $sig_flds, $input, $freq ) ) { move( $_, $sig_flds, $input, $part_answer, $freq ) +; ++$progress; next; } if ( removeable( $_, $sig_flds, $input, $freq ) ) { remove( $_, $sig_flds, $input, $part_answer, $freq + ); ++$progress; } } } while ($progress); } while ( ref $input && $start != @$input ); return $input; } sub factorial { my $n = int(shift); my $ret = $n; while ( $n > 1 ) { $ret *= ( $n - 1 ); $n -= 1; } return $ret; } # return true if rec will fill any possible need in answer sub useful { my ( $sig_flds, $curr_rec, $needs ) = @_; for my $fld (@$sig_flds) { if ( exists $needs->{$fld}{ $curr_rec->[$fld] } ) { return 1; } } return; } # check a needs/freq hash for emptiness sub empty { my ( $sig_flds, $needs ) = @_; local $_; die "No tying" if defined tied $needs; for (@$sig_flds) { return 1 if "0" ne $needs->{$_}; } return 0; } # move item from input to output sub move { my ( $rec, $sig_flds, $input, $answer, $freq ) = @_; local $, = ' '; # print "moving @{$input->[$rec]}$/"; for my $fld (@$sig_flds) { $freq->{$fld}->{ $input->[$rec]->[$fld] } = 0; } push @$answer, $input->[$rec]; splice @$input, $rec, 1; return; } # determine if item should go to output sub moveable { my ( $rec, $sig_flds, $input, $freq ) = @_; # is moveable if any fld is unique for my $fld (@$sig_flds) { return 1 if 1 == $freq->{$fld}->{ $input->[$rec]->[$fld] }; } return 0; } # remove unnecessary item from input sub remove { my ( $rec, $sig_flds, $input, $answer, $freq ) = @_; local $, = ' '; # print "removing @{$input->[$rec]}$/"; for my $fld (@$sig_flds) { if ( $freq->{$fld}->{ $input->[$rec]->[$fld] } != 0 ) { --$freq->{$fld}->{ $input->[$rec]->[$fld] }; } } splice @$input, $rec, 1; return; } # determine if item is unneeded in input sub removeable { my ( $rec, $sig_flds, $input, $freq ) = @_; my $flds_needed = 0; my $removeable = 0; my $score = 0; # removeable if only 1 field needed and that's not unique or if # no fields needed for my $fld (@$sig_flds) { ++$flds_needed if $freq->{$fld}->{ $input->[$rec]->[$fld] }; $score += $freq->{$fld}->{ $input->[$rec]->[$fld] }; } $removeable = 1 if 1 == $flds_needed || 0 == $flds_needed; return $removeable; } # cmp using a \@list_of_field_indices cmp two \@arrays, # if both records' fields have a frequency of 0 their values don't mat +ter sub cmp_on_flds { my ( $sig_flds, $freq, $aa, $bb ) = @_; my $ret; for my $fld (@$sig_flds) { no warnings 'uninitialized'; if ( 0 == $freq->{$fld}->{ $aa->[$fld] } && 0 == $freq->{$fld}->{ $bb->[$fld] } ) { next; } $ret = $aa->[$fld] cmp $bb->[$fld]; return $ret if $ret; } return 0; } # delete items from a sorted \@array where \@fields_indexed are same # and adjust frequency info sub unique_on_flds { my ( $sig_flds, $data, $freq ) = @_; my ( $prev, $ret ); while (@$data) { my $not_dupe = @$sig_flds; my $curr = shift @$data; $not_dupe = cmp_on_flds( $sig_flds, $freq, $curr, $prev ); $prev = $curr; if ($not_dupe) { push @$ret, $curr; } else { # adjust frequencies for my $fld (@$sig_flds) { --$freq->{$fld}->{ $curr->[$fld] }; } } } return $ret; } # display scored records sub display { my ( $sig_flds, $aref, $freq ) = @_; local $_; local $" = ' '; for my $el (@$aref) { no warnings 'uninitialized'; for (@$sig_flds) { printf "%4i %-6s ", $freq->{$_}->{ $el->[$_] }, $el->[$_]; } print " @$el"; print $/; } print $/; return; } # this is from perlmonks.org!Kraythorne to parse input files sub parse_csv { my $text = shift; my $delim = shift; # not used my $type = undef; if ( $delim && $delim =~ /comma|tab|pipe|fixed/i ) { ( # $type, undef, $delim ) = $delim =~ m/\A (.+) \s - \s (.+) \z/xms; } my @new = (); if ($delim) { while ( $text =~ m{ # the first part groups the phrase inside the quotes. # see explanation of this pattern in MRE "([^\"\\]*(?:\\.[^\"\\]*)*)"$delim? | ([^$delim]+)$delim? | $delim }gx ) { if ( defined $+ ) { push( @new, $+ ); } else { push( @new, '' ); } } push( @new, '' ) if substr( $text, -1, 1 ) eq ($delim); } else { $new[0] = $text; } return @new; # list of values that were comma-separated } # permute -- generate an iter yielding all the permutations # of a list, basically lifted from "Higher Order Perl", # M.J. Dominus, pp 134-135 sub permute { my @data = @_; my $perm = 0; return sub { do { $perm++; return @data } if !$perm; my $i; my $p = $perm; for ( $i = 1; $i <= @data && $p % $i == 0; $i++ ) { $p /= $i; } my $d = $p % $i; my $j = @data - $i; return if $j < 0; @data[ $j + 1 .. $#data ] = reverse @data[ $j + 1 .. $#data ]; @data[ $j, $j + $d ] = @data[ $j + $d, $j ]; $perm++; return @data; }; } # partition -- split an array so that each partition shares # no values per significant field w/ another partition # example: # part 1: ( [a,a,a],[b,a,a],[b,c,d],[e,f,d] ) # part 2: ( [f,g,g],[x,g,x],[y,z,x],[z,z,z] ) # sub partition { my ( $sig_flds, $input ) = @_; my $ret; while (@$input) { my $part = []; my $part_has = {}; my $curr = pop @$input; _add_rec2part( $sig_flds, $curr, $part, $part_has ); my $remainder = $input; my $next_remainder = []; while (@$remainder) { $curr = pop @$remainder; if ( _part_needs_rec( $sig_flds, $curr, $part, $part_has ) + ) { _add_rec2part( $sig_flds, $curr, $part, $part_has ); } else { push @$next_remainder, $curr; } } push @$ret, $part; @$remainder = @$next_remainder; } return $ret; } # boolean, does the partition need this record sub _part_needs_rec { my ( $sig_flds, $rec, $part, $freq ) = @_; for my $f (@$sig_flds) { if ( $freq->{$f}->{ $rec->[$f] } ) { return 1; } } return; } # add record to partition and adjust partition's needs hash sub _add_rec2part { my ( $sig_flds, $rec, $part, $freq ) = @_; push @$part, $rec; for my $fld (@$sig_flds) { $freq->{$fld}->{ $rec->[$fld] } = 1; } return; }
    Be well,
    rir


Comment on Re: Most efficient record selection method?
Select or Download Code
Re^2: Most efficient record selection method?
by Kraythorne (Sexton) on Feb 21, 2009 at 22:56 UTC
    Ooops posted my code higher up.(i.e. completed code that appears to work)

    Update: Copied the code below so more easily found

    Just to confirm - it will accept any number of files and any selection of fields and calculate the minimum amount of records required from across all the files to demonstrate the full range of field values for the fields selected. (I think!).

    If anyone can pick up on any bugs please let me know :-)
      Update: I've copied this to where people can see it. Hopefully it will make sense to someone

      Well I think I have something that works - don't really know how efficient it is so any advice would be most welcome: The Datatools::Parsecsv is a simple subroutine we use:
      sub parse_csv { my $text = shift; my $delim = shift; my $type = undef; if ($delim && $delim =~ /comma|tab|pipe|fixed/i){ ($type, $delim) = $delim =~ m/\A (.+) \s - \s (.+) \z/xms; } my @new = (); if ($delim){ while ($text =~ m{ # the first part groups the phrase inside the quotes. # see explanation of this pattern in MRE "([^\"\\]*(?:\\.[^\"\\]*)*)"$delim? | ([^$delim]+)$delim? | $delim }gx){ if (defined $+){ push(@new, $+) } else{ push(@new, '') } } push(@new, '') if substr($text, -1,1) eq ($delim); } else{ $new[0] = $text; } return @new; # list of values that were comma-separated }


      Anyway below is the code which I've tested on a few files and it appears to function. Please let me know if any of the code could be made more efficient - I have commented the code as much as possible.

      #!c:/usr/bin/perl use strict; use warnings; use Datatools::Parsecsv; my @field_selection = qw(23 24 25 26); #ARRAY INDEX LOCATIONS my @file_list = (File1, File2, File3 ); my %field_key = (); my %file_key = (); my $record_key_count = undef; #set location of the sample record output file my $proof_file = 'K:\Perl Development\Test Files\J02394_proofs.csv'; foreach my $file (@file_list){ my $header = 1; open (INFILE, "$file") || die "Cannot open file $file for reading: $!\n"; while (my $record = <INFILE>){ chomp $record; #ignore record header if ($header == 1){ $header = 0; next; } #split fields into @fields my @fields = parse_csv($record, ','); my $record_key = undef; #build up concatenated key with pipe delimiter #create field_key{field number} -> {field_key} = occurences foreach my $field_number (@field_selection){ $record_key .= '|'.$fields[$field_number]; ${$field_key{$field_number}}{$fields[$field_number]}++; } #create #file_key{filename}->{concatenated record key} = full record ${$file_key{$file}}{$record_key} = $record; } close INFILE; } open (PROOF_FILE, ">$proof_file") || die " cannot write to proof file $proof_file: $!\n"; my $built_key = undef; OPTIMISE: #Generic hash for value sorting hashes my %hash_value_to_sort = (); #keep track of smallest field tests my %smallest_field = (); #keep track of tried fields for multiple passes my %tried_field = (); #keep track of smallest fields that revert to #global_match my %exhausted = (); my $match_result = 0; #recurse built keys until match result == 1 while ($match_result == 0){ $built_key = &build_key(); $match_result = &check_key_match($built_key); } goto OPTIMISE; sub build_key{ my $still_keys_left = undef; my $appended_keys = undef; #cycle through field selection foreach my $field_number (@field_selection){ #when field keys exhausted #keep global until succesfull match #resets %exhausted if (exists $exhausted{$field_number}){ $appended_keys .= '\|' . 'global_match'; next; } #get a key for each field and build up final key my $found_key = &get_field_key($field_number); print "$field_number:" . ${$found_key}{'key'} . "\n"; $appended_keys .= '\|' . ${$found_key}{'key'}; #if field key returns global match then all #field keys have been exhausted. No need to #check for for smallest for this field anymore #so clear %smallest-field if it relates to this #field if (${$found_key}{'key'} =~ m/ global_match /xms){ $exhausted{$field_number} = '1'; if (exists $smallest_field{$field_number}){ %smallest_field = (); } } #otherwise this field still has keys left else{ $still_keys_left = 1; } #keep track of tried keys for this field incase #we have multiple passes ${$tried_field{$field_number}}{${$found_key}{'key'}} = 1; #don't bother with defining smallest once fields #exhausted go to next field if (exists $exhausted{$field_number}){ next ; } #1st definition #flag the field number and record number of #occurances that the key was found. Flag the field #as smallest. if (not defined $smallest_field{'number'}){ $smallest_field{'number'} = ${$found_key}{'number'}; $smallest_field{$field_number} = '1'; } #otherwise check current number of occurences for #this key and replace smallest if lower.Flag the #field as smallest. elsif (${$found_key}{'number'} < $smallest_field{'number'}){ $smallest_field{'number'} = ${$found_key}{'number'}; $smallest_field{$field_number} = '1'; } } #if no keys left to find, close the proof file and exit #the program if (not defined $still_keys_left){ close PROOF_FILE; exit; } #otherwise return the appended key return $appended_keys; } sub get_field_key{ #field we want to get a key for my $field = shift; #generic hash for value sorting %hash_value_to_sort = %{$field_key{$field}}; #cycle keys lowest number of occurrences first #this helps to optimise the record selection foreach my $key ( sort HashValueSort keys %hash_value_to_sort ) { #check if the field is flagged smallest occurence #only select next if key not already tried if (exists $smallest_field{$field}){ if (exists ${$tried_field{$field}}{$key}){ next; } } #return key and number of occurances return {'key' => $key, 'number' => $hash_value_to_sort{$key} }; } #if no valid key avaiable (i.e. all tried or all keys found) #return a global match for this field return {'key' => 'global_match', 'number' => '0' }; } sub check_key_match{ my $check_key = shift; #substitute with global pattern match $check_key =~ s/\|global_match/\|\.\+/g; #recurse through each file until a record key is #found in the file hash #print matching record from %file_key hash #delete matching keys from %field_key hash #delete matching record from $file_key hash foreach my $file (@file_list){ foreach my $key (keys %{$file_key{$file}}){ if ($key =~ m/\A $check_key \z/xms){ my $not_dupe = &delete_keys_found($key); if ($not_dupe == 1){ print PROOF_FILE ${$file_key{$file}}{$key} . "\n"; print STDOUT ${$file_key{$file}}{$key} . "\n"; delete ${$file_key{$file}}{$key}; } #flag match found return 1; } } } #flag no match found return 0; } sub delete_keys_found{ my $delete_keys = shift; my @delete_keys = split ('\|', $delete_keys); my $found_key = 0; #make up any blank last field after split on '|' while ($#delete_keys < $#field_selection){ push @delete_keys,''; } #ignore empty first index after split on '|' my $fld_num = 1; foreach my $field (@field_selection){; if (exists ${$field_key{$field}}{$delete_keys[$fld_num]}){ delete ${$field_key{$field}}{$delete_keys[$fld_num]}; $found_key = 1; } $fld_num++; } #if no keys found to delete then all were dupes #so flag and do not print record return $found_key; } sub HashValueSort{ $hash_value_to_sort{$b} <=> $hash_value_to_sort{$a}; }

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://744882]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others meditating upon the Monastery: (8)
As of 2014-07-31 11:51 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (248 votes), past polls