aartist has asked for the
wisdom of the Perl Monks concerning the following question:
I have a CSV file and has 100+ columns and 10,000+ lines. I like to find unique key from the data. Key is formed by mixing multiple columns. For example columns 1,2,3,12,14 could form a unique key for the given data. How would I find such thing? I can take all the columns and that will make it unique, but I like to do with minimum number of columns possible.
Thanks.
Re: Finding Keys from data by wind (Priest) on Mar 31, 2011 at 15:38 UTC |
Could always just use a foreign key like the line number ($.)? That would be unique for yours or any given csv file. We aren't really going to be able to advise you on how to refer to your data more specifically without knowing anything about it.
One small note is that be sure to check out Text::CSV and Text::CSV_XS for processing of csv files.
| [reply] [d/l] |
|
Sorry, I cannot use additional columns such as line numbers. That will not serve the purpose. I have to find the keys from existing data and take these keys and compare with some other data, which has additional columns and rows.
| [reply] |
Re: Finding Keys from data by jethro (Monsignor) on Mar 31, 2011 at 15:46 UTC |
Finding an optimal solution may not be possible (without searching the complete space of 2^100 possibilities) but you might try something like this:
1) For each column count the number of different values in it (easily done with hashes)
2) Sort that column list lowest number first
3) Set a key of all columns
3) Foreach column on that list try to remove it from the key and see if two lines have the same key now. if yes, re-add the column to your key, otherwise leave it off
This will give you exactly one solution. If you want to look for better solutions, you might rerun the algorithm, but scramble the sorted list of columns somewhat to get at different solutions. You could even use a totally random column list, but most likely you will get too many suboptimal solutions presented that way
| [reply] |
|
jethro:
I like that approach, but rather than excluding columns one-by-one, I'd include them one-by-one, like this:
$ cat 896650.pl
#!/usr/bin/perl
use strict;
use warnings;
# Generate values
my @cols = ( 0 .. 10 );
my $rows=50000;
my @data = ( [ map { int 100*rand } @cols ] );
my @gen = ( map { [ 100*rand, int 100*(1+$_*$_*$_*rand) ] } @cols );
for my $r (1 .. $rows) {
for my $c ( @cols ) {
$data[$r][$c] = int( $data[$r-1][$c]
+ $gen[$c][0]*rand) % ($gen[$c][1]);
}
}
# Count distinct column values
my %H;
for (@data) {
my @v=@$_;
for my $c (@cols) {
$H{$c}{$v[$c]}++;
}
}
# Sort so most diverse comes first
my @colrank = sort { $$b[1] <=> $$a[1] }
map { [ $_, scalar keys %{$H{$_}} ] } @cols;
my @colOrder = map { $$_[0] } @colrank;
for my $r (@colrank) {
print "col $$r[0] has $$r[1] values\n";
}
print join(", ", @colOrder), "\n";
print scalar(@data), " rows to process.\n";
my @tmpdata = @data;
my @keycols = ( shift @colOrder );
while (keys %H < @tmpdata) {
print scalar(@tmpdata), " rows to process.\n";
%H=();
for (@tmpdata) {
my $key=join(".", @$_[@keycols]);
$H{$key}++;
}
print "cols(", join(",",@keycols), ") give ",
scalar(keys %H), " rows.\n";
push @keycols, shift @colOrder;
}
$ perl 896650.pl
col 10 has 37754 values
col 9 has 29806 values
col 8 has 29772 values
col 7 has 22835 values
col 6 has 16081 values
col 5 has 10136 values
col 3 has 2726 values
col 4 has 559 values
col 2 has 216 values
col 1 has 176 values
col 0 has 1 values
10, 9, 8, 7, 6, 5, 3, 4, 2, 1, 0
50001 rows to process.
50001 rows to process.
cols(10) give 37754 rows.
50001 rows to process.
cols(10,9) give 49996 rows.
50001 rows to process.
cols(10,9,8) give 50001 rows.
However, I'd wonder what the utility of it would be...
I was thinking that one possible improvement of my code would be to remove the "duplicate" rows after determining each key, then redo the column count so you could get the best remaining key from the list. It certainly wouldn't be the optimal solution, but I'd bet it gets pretty close.
...roboticus
When your only tool is a hammer, all problems look like your thumb. | [reply] [d/l] |
|
My reason to do it backwards was because I thought that checking for whether a key could be removed or not was easier than checking if a key should be added. Because in the first case you only have to check whether one key is duplicate now (easy to do with a hash. In the second case you have to check whether the addition of the key improves anything.
This is a big problem with your solution now. Say two columns have the same values, but with many different values so they are both high on the @colOrder list. They will both be added even though only one of them would improve the key quality, the other is completely useless.
You also talk about an improvement by removing "duplicate" rows. Did you mean "singular" rows? I.e. a row that is already singled out/identified completely by the key isn't important anymore for finding relevant keys and could be removed. This is one possible way to check for whether a key improves things.
One factor that also should be taking into account, if you want to determine whether adding or removing is better, is how many keys the solution likely needs. If the data is nearly as random as BrowserUKs test data, obviously the solution will only need a few columns. But if many columns are linked to each other and have only a few different values, the key needs many columns
Though the more columns the key needs the worse it is as a key ;-). A key that needs 50 of the 100 columns already uses half the space of the database itself
| [reply] |
|
Re: Finding Keys from data by moritz (Cardinal) on Mar 31, 2011 at 15:46 UTC |
If this is for managing database-like data, don't do it. Add an artificial primary key.
If you really need it, and your XY Problem has no better solution, you need brute force.
If you want a faster but approximate solution, read up on greedy algorithms.
| [reply] |
Re: Finding Keys from data by BrowserUk (Pope) on Mar 31, 2011 at 16:21 UTC |
#! perl -slw
use strict;
use Algorithm::Combinatorics qw[ variations ];
my %uniq;
while( <DATA> ) {
chomp;
my @fields = split ',';
OUTER: for my $k ( 1 .. $#fields ) {
my $iter = variations( \@fields, $k );
while( my $v = $iter->next ) {
my $key = join ':', @$v;
unless( exists $uniq{ $key } ) {
$uniq{ $key } = $.;
last OUTER;
}
}
}
}
my @lineorder = sort{ $uniq{ $a } <=> $uniq{ $b } } keys %uniq;
print "$uniq{ $_ } :: $_" for @lineorder;
__DATA__
2,6,5,2,1,1,7,9,8,6
5,8,5,0,9,3,8,9,0,2
2,3,1,1,5,2,9,7,8,3
0,1,3,7,6,2,4,3,7,5
4,6,2,8,6,4,1,5,4,3
4,6,7,2,0,9,6,5,0,9
5,6,2,4,3,7,1,9,3,5
2,5,7,1,0,0,0,5,8,5
3,8,1,4,9,2,5,8,1,0
5,2,2,2,0,7,2,8,3,1
7,1,2,6,5,4,0,9,2,5
1,6,3,7,3,8,7,0,7,7
0,0,8,9,9,8,3,3,6,0
0,2,5,3,8,4,1,8,9,4
5,6,9,0,6,4,9,5,0,7
9,0,9,3,2,6,3,2,4,6
3,3,0,4,8,5,7,7,2,4
3,1,3,0,0,3,1,7,3,8
0,6,7,0,8,9,4,8,4,8
0,2,0,3,7,4,6,8,4,5
Output (line number :: uniq key} c:\test>896650
1 :: 2
2 :: 5
3 :: 3
4 :: 0
5 :: 4
6 :: 6
7 :: 7
8 :: 1
9 :: 8
10 :: 5:2
11 :: 9
12 :: 1:6
13 :: 0:0
14 :: 0:2
15 :: 5:6
16 :: 9:0
17 :: 3:3
18 :: 3:1
19 :: 0:6
20 :: 0:3
But do realise that this is a massively combinatorial problem, and given the size of your data, if there is a high degree of similarity in your lines, it could take a very long time.
And, the results produced are in no way 'optimal'. That is to say, there is no attempt here to produce the set of shortest signatures. To do so would make this an NP hard problem of a massive scale.
By way of example: With 100 fields, there are ~10^157 variations of fields for each line. To find the optimal solution, you'd need to try (something like) 10^157! permutations. That number is so massive as to be impossible to calculate--never mind performing that number of trials--even if you had all the computers and storage in the world.
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] [select] |
|
Why 10^257? A column is either in or not in the key. That would mean 2^100 possible solutions to try
Or are you refering only to the worst case of your algorithm? Seems using permutation/variation might lead to much overhead in the search. If the combination 2:4 were already tested for example, testing 4:2 is not necessary anymore
| [reply] |
|
Why 10^257? A column is either in or not in the key. That would mean 2^100 possible solutions to try
For a unique key 'ab' is different to 'ba'.
So, it should be 100! + 99! + 98! + 97! ... so I rounded up 10^157.
To be honest, given the scale of the numbers, it makes little difference.
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] |
|
Re: Finding Keys from data (where to start) by tye (Cardinal) on Mar 31, 2011 at 17:36 UTC |
My approach would be:
-
Find the most common value for each column, $mode[$col].
-
Count how many times that value occurs in its column, $freq[$col].
-
Assign each column the value, $entropy[$col]= $rows/$freq[$col].
-
Sort @entropy (descending).
-
Multiply the largest values of @entropy until you get to a value close to $rows.
-
Build a 'key' from the columns that contributed the entropy values that produced the near-$rows product.
-
Compute $mode and $freq for that key.
-
If $freq is 1, then drop the last column and try again, looking for a more minimal solution.
-
If $freq is more than 1, add the column that contributes the next-smaller value from @entropy.
After you've found the minimal solution using the initial ordering, you can look for "near by" solutions that use fewer columns by randomly removing a column (weighted somewhat toward the selected columns providing the least entropy) and, when required, randomly adding a new column (weighted somewhat toward the unselected columns providing the most entropy). Continue this until your solution seems adequate for the time you've spent searching for it.
You will likely end up building a list of several equally or nearly equally minimal solutions and working through that list trying to adjust each one so that it becomes more minimal. This is at least similar to an asexual "genetic" algorithm (I wouldn't bother "breeding" two solutions together). You'd build a population of "more successful" solutions were each generation adds several modified copies of each solution and culls the least successful of the new population (the most successful solutions get the most modified copies of themselves added). In that scenario, it would be worth caching previously culled potential solutions so you don't waste time recomputing their fitness.
If you're data is interestingly correlated, then you might want to use a different formula for $entropy. For example, if you have a couple of columns that have a pattern of values like "none,1", "none,2", "none,3",... "none,100", "a,0", "b,0",... "z,0"... then each column might have a entropy of only 2 but together they actually form a unique key.
In such a case, you should probably modify the computation of $entropy to take into account the number of unique values in a column, not just the frequency of the least unique value. The minimum entropy would be $rows/$freq and the maximum entropy would be $uniqs. I'd probably first try combining the two using an geometric mean: $entropy= sqrt($uniqs*$rows/$freq).
Although I describe "find the mode" and "find the frequency of the mode" as two separate steps, it is more likely that you'd perform a single step that would then give you $freq and $uniqs for a column ($mode is only useful in explaining the approach and isn't actually useful writing the code).
I also say "sort @entropy" but clearly you'd want to "sort { $entropy[$b] <=> $entropy[$a] } 0..$#entropy".
Update: There may also be cases where it is helpful to compute the actual combined "entropy" of pairs or small groups of columns and then add more successful groups together. However, the way I outlined above is going to be at least close to the fastest way to come up with a solution that has decent odds of not being among the worst of the solutions. Looking at pairs after that might allow refining to progress more quickly.
| [reply] [d/l] [select] |
Re: Finding Keys from data by BrowserUk (Pope) on Mar 31, 2011 at 18:56 UTC |
#! perl -slw
use strict;
use Data::Dump qw[ pp ];
use Algorithm::Combinatorics qw[ combinations ];
my @lines = map [ do{ chomp; split ',' } ], <DATA>;
my $w = $#{ $lines[0] };
my @keyFields;
WIDTH: for my $k ( 1 .. $w ) {
my $iter = combinations( [ 0 .. $w ], $k );
COMB: while( my $c = $iter->next ) {
my %uniq;
for my $a ( @lines ) {
next COMB if exists $uniq{ join ',', @{ $a }[ @$c ] };
$uniq{ join ',', @{ $a }[ @$c ] } = 1;
}
@keyFields = @$c;
last WIDTH;
}
}
print "All lines are unique using a combination of fields:[ @keyFields
+ ]";
__DATA__
2,6,5,2,1,1,7,9,8,6
5,8,5,0,9,3,8,9,0,2
2,3,1,1,5,2,9,7,8,3
0,1,3,7,6,2,4,3,7,5
4,6,2,8,6,4,1,5,4,3
4,6,7,2,0,9,6,5,0,9
5,6,2,4,3,7,1,9,3,5
2,5,7,1,0,0,0,5,8,5
3,8,1,4,9,2,5,8,1,0
5,2,2,2,0,7,2,8,3,1
7,1,2,6,5,4,0,9,2,5
1,6,3,7,3,8,7,0,7,7
0,0,8,9,9,8,3,3,6,0
0,2,5,3,8,4,1,8,9,4
5,6,9,0,6,4,9,5,0,7
9,0,9,3,2,6,3,2,4,6
3,3,0,4,8,5,7,7,2,4
3,1,3,0,0,3,1,7,3,8
0,6,7,0,8,9,4,8,4,8
0,2,0,3,7,4,6,8,4,5
Output: c:\test>896650-2
All lines are unique using a combination of fields:[ 3 6 ]
On a randomly generated dataset of the scale you've suggested, it takes around 30 seconds to complete: [19:42:44.71] c:\test>head -2 896650.dat
357,67,815,493,516,302,810,899,672,91,542,795,527,429,217,502,811,554,
+345,312,689,336,710,194,778,869,711,413,402,313,960,351,264,511,558,3
+01,184,414,955,528,44,839,786,363,629,599,611,623,488,534,948,140,489
+,474,511,662,217,665,58,930,879,683,764,203,863,384,509,5,612,903,0,3
+82,969,240,130,460,652,474,478,562,7,117,360,688,702,657,329,626,521,
+808,547,477,903,510,913,883,398,201,375,729
675,393,313,605,265,95,415,813,667,95,945,188,2,646,803,534,842,589,29
+9,934,395,429,34,733,684,412,799,463,896,130,896,505,530,669,793,750,
+527,865,514,55,25,659,668,780,892,119,532,163,707,132,85,841,327,314,
+408,284,714,166,344,425,165,998,843,272,612,671,873,772,134,665,331,7
+26,356,522,838,22,875,191,835,965,373,304,885,7,586,908,588,623,764,7
+02,4,656,794,289,18,872,639,495,513,35
[19:54:27.66] c:\test>wc -l 896650.dat
10000 896650.dat
[19:54:40.64] c:\test>896650-2 896650.dat
All lines are unique using a combination of fields:[ 0 1 2 ]
[19:55:10.04] c:\test>
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] [select] |
|
|