Re: Most efficient record selection method?

in reply to Most efficient record selection method?

Update: Major update. I am going to reexamine putting my $HOME under revision control.

Here is the algorithm.

For each value in a field, keep track of number of occurences in the input. Set the frequency to zero for the values when a record is moved to the output. So:
```
  $freq->{$fld_no}->{ $record[$fld_num] } = 0;
[download]
```
is a central fragment. The score zero means the data field value is no longer needed; a score equal one means the value is needed in the results; and a score greater than one means the value is needed but which record is undetermined.

Repeat on the input:

Sort it, remove dupes.
Move records known to be needed.
Delete unneeded records.
A record is a duplicate if all fields are equal or the fields' scores are both zero.
Update: That's not so clear; an example (FieldValue:FieldScore):
Your:0 Horse:4 Blue:0 My:0 Horse:4 Black:0 match because the ownerships and colors don't matter.
A record is needed if any significant field is unique (has a score equal to one).
A record is deletable if one significant field has a score greater than one and the rest are zero. This is just a directly identifiable duplicate.
That removes the easy stuff, what is in the result needs to be added to any final solution of what remains in the input. The rest is brute force or the long walk.
It may be possible to split the input into smaller problems. This would involve grabbing a record and all its sisters and cousins. If we initially select: Peace Happiness Joy then we choose every record with Peace in field one, Happiness--two, Joy--three; then we repeat for every chosen record. There's no need to guess at a good starting choice, either there will be partitions or not.
Examine the head of every permutation of the input set, looking for the ideal solution. The ideal solution's size would be equal to the largest count of differing values for a field. If there is no ideal solution, increase the size of the head until a solution is found. In doing this count the previously created output with the head. In these situations, I tend to look for a solution then for an improvement; it allows quitting with a sub-optimal solution.
In the above, I write of an ideal solution size. Determining the best possible and worst possible size of our answer allows breaking off processing when it's pointless.

Possibly change the permutation machine so that each "permutation" array is treated as a wheel, reducing the number of arrays needing to be created. I think the walking is the time sink; not creating the walkways.

Here's unguaranteed code.

#!/usr/bin/perl
use warnings;
use strict;
use FileHandle;
use Clone 'clone';
use Data::Compare;
$|++;

my @files = qw( File1 File2 File3 );

my $sig_flds = [ 0 .. 2 ];  # config, the fields that we care about

@$sig_flds = sort @$sig_flds;

my $answer = [];
my $freq   = {};


my $input;                  # AoA ref

# put data into an array
for my $file (@files) {
    my $fh = FileHandle->new( $file, 'r' ) or die "Cannot open $file";
    scalar <$fh>;           # eat header line
    my $data = [<$fh>];
    chomp @$data;
    $_ = [ parse_csv( $_, ',' ) ] for @$data;
    close $fh or die "Cannot close $file";
    push @$input, @$data;
}

my $sanity_check = make_ck_list( $sig_flds, $input);

sub make_ck_list {
    my ( $sig_flds, $input) = @_;
    my %answer_check;

    for my $rec ( @$input ) {
        for my $fld ( @$sig_flds ) {
            $answer_check{ $fld }{ $rec->[$fld] } = 1;
        }
    }
    return \%answer_check;
}

print "Orig unscored input size(", scalar(@$input), "):$/";
display( $sig_flds, $input, $freq );

sample( $sig_flds, $input, $answer, $freq );

print "Solution size(", scalar(@$answer), "):$/";
display( $sig_flds, $answer, $freq );

my $answer_list = make_ck_list( $sig_flds, $answer);

print "Bad answer$/" unless Compare( $answer_list, $sanity_check );


exit;


# create a smallest possible sample with all values represented.
sub sample {
    my ( $sig_flds, $input, $answer, $freq ) = @_;
    local $_;
    my $part_answer = [];

    $input = _reduce_input( $sig_flds, $input, $part_answer, $freq );

    # are we lucky/done ? return
    do { @$answer = @$part_answer; return } if !defined $input;

    # reassemble data w/o duplicates
    @$input = ( @$input, @$part_answer );

    # partition into separate problems
    my $partition = partition( $sig_flds, $input );

    # solve each problem, dumping all the solutions together

    # process each partition
    my $ct;
    for my $part (@$partition) {
        # print "Starting partition", ++$ct, $/;
        $freq        = {};
        $part_answer = [];

        # this does a little extra work this time
        $input = _reduce_input( $sig_flds, $part, $part_answer, $freq 
+);

        solve_part( $sig_flds, $input, $part_answer, $freq )
            if defined $input;
        push @$answer, @$part_answer;
        $input = [];
    }
    return;
}

# solve_part -- solve a partition of data by going
#   through all permutations until a good solution
#   is found
sub solve_part {
    my ( $sig_flds, $input, $answer, $freq ) = @_;

    $answer = [] if not defined $answer;
    my $part_answer = $answer;

    # factor to predict size of solution
    my ( $best_possible_size, $worst_possible_size )
        = size_answer( $sig_flds, $input, $answer, $freq );

    # master needs list;
    my %master_needs = %$freq;

    # delete @$part_answer items from needs
    for my $rec (@$part_answer) {
        for my $fld (@$sig_flds) {
            if ( exists $master_needs{$fld}->{ $rec->[$fld] } ) {
                delete $master_needs{$fld}->{ $rec->[$fld] };
            }
        }
    }

    my $iter_perms = permute(@$input);

    my $cur_best_size = $worst_possible_size;
    my $best_answer;

    my $total_perms = factorial( scalar @$input );

    my $ct;
    while ( my @perm = $iter_perms->() ) {
        ++$ct;

        my $curr_ans;
        my $needs = clone( \%master_needs );

        @$curr_ans = @$part_answer;

        while (@perm) {


            my $curr_rec = pop @perm;

            if ( useful( $sig_flds, $curr_rec, $needs ) ) {
                push @$curr_ans, $curr_rec;
                for my $fld (@$sig_flds) {
                    delete $needs->{$fld}{ $curr_rec->[$fld] };
                }
            }
        }
        if ( $ct % 1000 == 0 ) {  # XXX
            #print
            #    " Permutation: $ct/$total_perms "
            #    . "target size: $best_possible_size best $cur_best_si
+ze$/";
        }
        if ( empty( $sig_flds, $needs ) ) {
            if ( @$curr_ans < $cur_best_size ) {
                $best_answer   = $curr_ans;
                $cur_best_size = @$curr_ans;
                #print "$/Permutation $ct$/";
                #print "We have a new contender: ";
                #print "Target size: $best_possible_size  ";
                #print "Contender's size: $cur_best_size$/";

                if ( $cur_best_size <= $best_possible_size ) {
                    @$answer = @$best_answer;
                    return;
                }
                #display( $sig_flds, $best_answer, $freq );
            }
        }
    }
    @$answer = @$best_answer;
    return;
}


# size_answer -- look at the data and determine:
#  1. the smallest possible size of an answer (or less if we must gues
+s),
#  2. the largest possible size of an answer (or more if we must guess
+),
#  This is an optimization: if smallest is accurate, we can break off
#  when one of the best solutions is found; if largest is accurate we
#  can stop processing a permutation when its solution is too large.
#  I haven't considered how well this can be estimated.
#  , the smallest is the size of the largest set of
#  data values in a significant field; and the largest is the greater 
+of
# the input size or ( sum of significant field sets plus scalar @$sig_
+flds
# minus one.
#
sub size_answer {
    my ( $sig_flds, $input, $answer, $freq ) = @_;
    my ( $best_possible_size, $worst_possible_size );

    # XXX stub
    return ( 1, 1000 );

}

# _reduce_input -- Remove duplicate records and cull
#   items that are easy to determine are unneeded.
#   Put items that are definitely needed into @$part_answer.
#   Delete culls from @$input.  And build occurrence %$frequency.
sub _reduce_input {
    my ( $sig_flds, $input, $part_answer, $freq ) = @_;
    local $_;

    # create frequency info
    for my $fld (@$sig_flds) {
        ++$freq->{$fld}->{ $_->[$fld] } for @$input;
    }

    my $start;
    do {
        return if not defined @$input;
        $start = @$input;    # progress flag

        # sort and remove dupes
        @$input = sort { cmp_on_flds( $sig_flds, $freq, $a, $b ) } @$i
+nput;
        $input = unique_on_flds( $sig_flds, $input, $freq );

        my $progress;
        do {
            $progress = 0;
            $_        = -1;
            while ( ref $input && $_ < @$input - 1 ) {
                $_++;
                if ( moveable( $_, $sig_flds, $input, $freq ) ) {
                    move( $_, $sig_flds, $input, $part_answer, $freq )
+;
                    ++$progress;
                    next;
                }
                if ( removeable( $_, $sig_flds, $input, $freq ) ) {
                    remove( $_, $sig_flds, $input, $part_answer, $freq
+ );
                    ++$progress;
                }
            }

        } while ($progress);

    } while ( ref $input && $start != @$input );

    return $input;
}

sub factorial {
    my $n   = int(shift);
    my $ret = $n;
    while ( $n > 1 ) {
        $ret *= ( $n - 1 );
        $n -= 1;
    }
    return $ret;
}

# return true if rec will fill any possible need in answer
sub useful {
    my ( $sig_flds, $curr_rec, $needs ) = @_;
    for my $fld (@$sig_flds) {
        if ( exists $needs->{$fld}{ $curr_rec->[$fld] } ) {
            return 1;
        }
    }
    return;
}

# check a needs/freq hash for emptiness
sub empty {
    my ( $sig_flds, $needs ) = @_;
    local $_;
    die "No tying" if defined tied $needs;
    for (@$sig_flds) {
        return 1 if "0" ne $needs->{$_};
    }
    return 0;
}

# move item from input to output
sub move {
    my ( $rec, $sig_flds, $input, $answer, $freq ) = @_;

    local $, = ' ';

    # print "moving  @{$input->[$rec]}$/";

    for my $fld (@$sig_flds) {
        $freq->{$fld}->{ $input->[$rec]->[$fld] } = 0;
    }
    push @$answer, $input->[$rec];
    splice @$input, $rec, 1;
    return;
}
# determine if item should go to output
sub moveable {
    my ( $rec, $sig_flds, $input, $freq ) = @_;

    # is moveable if any fld is unique
    for my $fld (@$sig_flds) {
        return 1 if 1 == $freq->{$fld}->{ $input->[$rec]->[$fld] };
    }
    return 0;
}

# remove unnecessary item from input
sub remove {
    my ( $rec, $sig_flds, $input, $answer, $freq ) = @_;

    local $, = ' ';

    # print "removing  @{$input->[$rec]}$/";

    for my $fld (@$sig_flds) {
        if ( $freq->{$fld}->{ $input->[$rec]->[$fld] } != 0 ) {
            --$freq->{$fld}->{ $input->[$rec]->[$fld] };
        }
    }

    splice @$input, $rec, 1;
    return;
}

# determine if item is unneeded in input
sub removeable {
    my ( $rec, $sig_flds, $input, $freq ) = @_;
    my $flds_needed = 0;
    my $removeable  = 0;
    my $score       = 0;

    # removeable if only 1 field needed and that's not unique or if
    # no fields needed
    for my $fld (@$sig_flds) {
        ++$flds_needed if $freq->{$fld}->{ $input->[$rec]->[$fld] };
        $score += $freq->{$fld}->{ $input->[$rec]->[$fld] };
    }
    $removeable = 1 if 1 == $flds_needed || 0 == $flds_needed;

    return $removeable;
}

# cmp using a \@list_of_field_indices cmp two \@arrays,
# if both records' fields have a frequency of 0 their values don't mat
+ter
sub cmp_on_flds {
    my ( $sig_flds, $freq, $aa, $bb ) = @_;

    my $ret;
    for my $fld (@$sig_flds) {
        no warnings 'uninitialized';
        if (   0 == $freq->{$fld}->{ $aa->[$fld] }
            && 0 == $freq->{$fld}->{ $bb->[$fld] } )
        {
            next;
        }
        $ret = $aa->[$fld] cmp $bb->[$fld];
        return $ret if $ret;
    }
    return 0;
}

# delete items from a sorted \@array where \@fields_indexed are same
# and adjust frequency info
sub unique_on_flds {
    my ( $sig_flds, $data, $freq ) = @_;
    my ( $prev,     $ret );

    while (@$data) {
        my $not_dupe = @$sig_flds;
        my $curr     = shift @$data;
        $not_dupe = cmp_on_flds( $sig_flds, $freq, $curr, $prev );

        $prev = $curr;
        if ($not_dupe) {
            push @$ret, $curr;
        }
        else {

            # adjust frequencies
            for my $fld (@$sig_flds) {
                --$freq->{$fld}->{ $curr->[$fld] };
            }
        }
    }
    return $ret;
}

# display scored records
sub display {
    my ( $sig_flds, $aref, $freq ) = @_;
    local $_;
    local $" = ' ';
    for my $el (@$aref) {
        no warnings 'uninitialized';
        for (@$sig_flds) {
            printf "%4i %-6s ", $freq->{$_}->{ $el->[$_] }, $el->[$_];
        }
        print " @$el";
        print $/;
    }
    print $/;
    return;
}

# this is from perlmonks.org!Kraythorne to parse input files
sub parse_csv {

    my $text  = shift;
    my $delim = shift;

    # not used    my $type  = undef;
    if ( $delim && $delim =~ /comma|tab|pipe|fixed/i ) {
        (

            # $type,
            undef, $delim ) = $delim =~ m/\A (.+) \s - \s (.+) \z/xms;
    }

    my @new = ();
    if ($delim) {
        while (
            $text =~ m{
        # the first part groups the phrase inside the quotes.
        # see explanation of this pattern in MRE
        "([^\"\\]*(?:\\.[^\"\\]*)*)"$delim?
           |  ([^$delim]+)$delim?
           | $delim
        }gx
            )
        {
            if ( defined $+ ) {
                push( @new, $+ );
            }
            else {
                push( @new, '' );
            }
        }
        push( @new, '' ) if substr( $text, -1, 1 ) eq ($delim);
    }
    else {
        $new[0] = $text;
    }
    return @new;    # list of values that were comma-separated
}

# permute -- generate an iter yielding all the permutations
# of a list, basically lifted from "Higher Order Perl",
# M.J. Dominus, pp 134-135

sub permute {
    my @data = @_;
    my $perm = 0;
    return sub {
        do { $perm++; return @data } if !$perm;
        my $i;
        my $p = $perm;
        for ( $i = 1; $i <= @data && $p % $i == 0; $i++ ) {
            $p /= $i;
        }
        my $d = $p % $i;
        my $j = @data - $i;
        return if $j < 0;

        @data[ $j + 1 .. $#data ] = reverse @data[ $j + 1 .. $#data ];
        @data[ $j, $j + $d ] = @data[ $j + $d, $j ];

        $perm++;
        return @data;
    };
}


# partition -- split an array so that each partition shares
#      no values per significant field w/ another partition
#      example:
#         part 1: ( [a,a,a],[b,a,a],[b,c,d],[e,f,d] )
#         part 2: ( [f,g,g],[x,g,x],[y,z,x],[z,z,z] )
#
sub partition {
    my ( $sig_flds, $input ) = @_;
    my $ret;

    while (@$input) {
        my $part     = [];
        my $part_has = {};
        my $curr     = pop @$input;

        _add_rec2part( $sig_flds, $curr, $part, $part_has );

        my $remainder      = $input;
        my $next_remainder = [];

        while (@$remainder) {
            $curr = pop @$remainder;

            if ( _part_needs_rec( $sig_flds, $curr, $part, $part_has )
+ ) {

                _add_rec2part( $sig_flds, $curr, $part, $part_has );
            }
            else {
                push @$next_remainder, $curr;
            }
        }
        push @$ret, $part;
        @$remainder = @$next_remainder;
    }
    return $ret;
}

# boolean, does the partition need this record
sub _part_needs_rec {
    my ( $sig_flds, $rec, $part, $freq ) = @_;

    for my $f (@$sig_flds) {
        if ( $freq->{$f}->{ $rec->[$f] } ) {
            return 1;
        }
    }
    return;
}

# add record to partition and adjust partition's needs hash
sub _add_rec2part {
    my ( $sig_flds, $rec, $part, $freq ) = @_;

    push @$part, $rec;

    for my $fld (@$sig_flds) {
        $freq->{$fld}->{ $rec->[$fld] } = 1;
    }
    return;
}
[download]

Be well,
rir

Comment on Re: Most efficient record selection method? Select or Download Code

Replies are listed 'Best First'.
Re^2: Most efficient record selection method? by Kraythorne (Sexton) on Feb 21, 2009 at 22:56 UTC
Ooops posted my code higher up.(i.e. completed code that appears to work) *Update:* Copied the code below so more easily found Just to confirm - it will accept any number of files and any selection of fields and calculate the minimum amount of records required from across all the files to demonstrate the full range of field values for the fields selected. (I think!). If anyone can pick up on any bugs please let me know :-)	[reply]
Re^3: Most efficient record selection method? by Kraythorne (Sexton) on Feb 24, 2009 at 15:17 UTC
*Update:* I've copied this to where people can see it. Hopefully it will make sense to someone Well I think I have something that works - don't really know how efficient it is so any advice would be most welcome: The Datatools::Parsecsv is a simple subroutine we use: sub parse_csv { my $text = shift; my $delim = shift; my $type = undef; if ($delim && $delim =~ /comma\|tab\|pipe\|fixed/i){ ($type, $delim) = $delim =~ m/\A (.+) \s - \s (.+) \z/xms; } my @new = (); if ($delim){ while ($text =~ m{ # the first part groups the phrase inside the quotes. # see explanation of this pattern in MRE "([^\"\\](?:\\.[^\"\\])*)"$delim? \| ([^$delim]+)$delim? \| $delim }gx){ if (defined $+){ push(@new, $+) } else{ push(@new, '') } } push(@new, '') if substr($text, -1,1) eq ($delim); } else{ $new[0] = $text; } return @new; # list of values that were comma-separated } [download] Anyway below is the code which I've tested on a few files and it appears to function. Please let me know if any of the code could be made more efficient - I have commented the code as much as possible. #!c:/usr/bin/perl use strict; use warnings; use Datatools::Parsecsv; my @field_selection = qw(23 24 25 26); #ARRAY INDEX LOCATIONS my @file_list = (File1, File2, File3 ); my %field_key = (); my %file_key = (); my $record_key_count = undef; #set location of the sample record output file my $proof_file = 'K:\Perl Development\Test Files\J02394_proofs.csv'; foreach my $file (@file_list){ my $header = 1; open (INFILE, "$file") \|\| die "Cannot open file $file for reading: $!\n"; while (my $record = <INFILE>){ chomp $record; #ignore record header if ($header == 1){ $header = 0; next; } #split fields into @fields my @fields = parse_csv($record, ','); my $record_key = undef; #build up concatenated key with pipe delimiter #create field_key{field number} -> {field_key} = occurences foreach my $field_number (@field_selection){ $record_key .= '\|'.$fields[$field_number]; ${$field_key{$field_number}}{$fields[$field_number]}++; } #create #file_key{filename}->{concatenated record key} = full record ${$file_key{$file}}{$record_key} = $record; } close INFILE; } open (PROOF_FILE, ">$proof_file") \|\| die " cannot write to proof file $proof_file: $!\n"; my $built_key = undef; OPTIMISE: #Generic hash for value sorting hashes my %hash_value_to_sort = (); #keep track of smallest field tests my %smallest_field = (); #keep track of tried fields for multiple passes my %tried_field = (); #keep track of smallest fields that revert to #global_match my %exhausted = (); my $match_result = 0; #recurse built keys until match result == 1 while ($match_result == 0){ $built_key = &build_key(); $match_result = &check_key_match($built_key); } goto OPTIMISE; sub build_key{ my $still_keys_left = undef; my $appended_keys = undef; #cycle through field selection foreach my $field_number (@field_selection){ #when field keys exhausted #keep global until succesfull match #resets %exhausted if (exists $exhausted{$field_number}){ $appended_keys .= '\\|' . 'global_match'; next; } #get a key for each field and build up final key my $found_key = &get_field_key($field_number); print "$field_number:" . ${$found_key}{'key'} . "\n"; $appended_keys .= '\\|' . ${$found_key}{'key'}; #if field key returns global match then all #field keys have been exhausted. No need to #check for for smallest for this field anymore #so clear %smallest-field if it relates to this #field if (${$found_key}{'key'} =~ m/ global_match /xms){ $exhausted{$field_number} = '1'; if (exists $smallest_field{$field_number}){ %smallest_field = (); } } #otherwise this field still has keys left else{ $still_keys_left = 1; } #keep track of tried keys for this field incase #we have multiple passes ${$tried_field{$field_number}}{${$found_key}{'key'}} = 1; #don't bother with defining smallest once fields #exhausted go to next field if (exists $exhausted{$field_number}){ next ; } #1st definition #flag the field number and record number of #occurances that the key was found. Flag the field #as smallest. if (not defined $smallest_field{'number'}){ $smallest_field{'number'} = ${$found_key}{'number'}; $smallest_field{$field_number} = '1'; } #otherwise check current number of occurences for #this key and replace smallest if lower.Flag the #field as smallest. elsif (${$found_key}{'number'} < $smallest_field{'number'}){ $smallest_field{'number'} = ${$found_key}{'number'}; $smallest_field{$field_number} = '1'; } } #if no keys left to find, close the proof file and exit #the program if (not defined $still_keys_left){ close PROOF_FILE; exit; } #otherwise return the appended key return $appended_keys; } sub get_field_key{ #field we want to get a key for my $field = shift; #generic hash for value sorting %hash_value_to_sort = %{$field_key{$field}}; #cycle keys lowest number of occurrences first #this helps to optimise the record selection foreach my $key ( sort HashValueSort keys %hash_value_to_sort ) { #check if the field is flagged smallest occurence #only select next if key not already tried if (exists $smallest_field{$field}){ if (exists ${$tried_field{$field}}{$key}){ next; } } #return key and number of occurances return {'key' => $key, 'number' => $hash_value_to_sort{$key} }; } #if no valid key avaiable (i.e. all tried or all keys found) #return a global match for this field return {'key' => 'global_match', 'number' => '0' }; } sub check_key_match{ my $check_key = shift; #substitute with global pattern match $check_key =~ s/\\|global_match/\\|\.\+/g; #recurse through each file until a record key is #found in the file hash #print matching record from %file_key hash #delete matching keys from %field_key hash #delete matching record from $file_key hash foreach my $file (@file_list){ foreach my $key (keys %{$file_key{$file}}){ if ($key =~ m/\A $check_key \z/xms){ my $not_dupe = &delete_keys_found($key); if ($not_dupe == 1){ print PROOF_FILE ${$file_key{$file}}{$key} . "\n"; print STDOUT ${$file_key{$file}}{$key} . "\n"; delete ${$file_key{$file}}{$key}; } #flag match found return 1; } } } #flag no match found return 0; } sub delete_keys_found{ my $delete_keys = shift; my @delete_keys = split ('\\|', $delete_keys); my $found_key = 0; #make up any blank last field after split on '\|' while ($#delete_keys < $#field_selection){ push @delete_keys,''; } #ignore empty first index after split on '\|' my $fld_num = 1; foreach my $field (@field_selection){; if (exists ${$field_key{$field}}{$delete_keys[$fld_num]}){ delete ${$field_key{$field}}{$delete_keys[$fld_num]}; $found_key = 1; } $fld_num++; } #if no keys found to delete then all were dupes #so flag and do not print record return $found_key; } sub HashValueSort{ $hash_value_to_sort{$b} <=> $hash_value_to_sort{$a}; } [download]	[reply] [d/l] [select]

In Section Seekers of Perl Wisdom