http://www.perlmonks.org?node_id=1229925

nysus has asked for the wisdom of the Perl Monks concerning the following question:

I'm writing a Perl module to try to autodetect whether a spreadsheet/csv file has a header row. It's a somewhat tricky problem especially when trying to factor in the kind of malformed data people might feed in. I'm sticking with simple cases for now. I'd like to try to do some statistical analysis of the data to help me. Unfortunately, my knowledge of statistics is very weak. I'm feeling my way in the dark. So take this sample column for example:

STATE NY NY NY NJ NY

It's obviously a column of states. One thing that might jump out to a computer is that the length of the first row is 5 letters while the rest of the rows are two letters. Things can of course get fuzzier. The first row might have 5 letters while the rest of the columns have 2 OR 3 letters. Other tell-tale signs might be the header column is a string while the column is full of numbers. Or the header column might be named "STATUS" while the data might only contain the words "ACTIVE" or "RETIRED." The size of the column is also an important factor. If there is a lot of data, any statistical approach will likely be more accurate.

The Math::NumberCruncher module has some useful functions I think I could use like standard deviation to help me analyze things like how diverse the dataset is for a certain property (length, number of unique values, etc.). But I'm not really sure how I might apply it to be useful in a practical way. I found this interesting article was useful but there's no code and not being familiar with statistics, I'm still not clear on exactly how that analysis was done.

I intend to analyze each column and try to come up with some kind of "likely has header" factor based on the analysis. If it looks like most columns have a header, then it will determine that the spreadsheet has a header row.

Sorry this question is so open-ended. But any tips/advice you can think of would be appreciated.

$PM = "Perl Monk's";
$MCF = "Most Clueless Friar Abbot Bishop Pontiff Deacon Curate Priest Vicar";
$nysus = $PM . ' ' . $MCF;
Click here if you love Perl Monks

  • Comment on Useful heuristics for analyzing arrays of data to determine column header
  • Download Code

Replies are listed 'Best First'.
Re: Useful heuristics for analyzing arrays of data to determine column header
by hdb (Monsignor) on Feb 15, 2019 at 07:29 UTC

    This is a very interesting endeavour! Here are my two cents:

    • If the first row has a string and everything else is numbers, the column has a header. Scalar::Util::looks_like_number could be useful.
    • If the first row has a number, it is not likely to be a header.
    • If the first row is a string, but repeats further below it is not likely to be a header.
    • If the value of the first row is unique but other values appear multiple times it is likely a header. This should be easy to implement.
    • I would assign some likelihood for each column. If the average is above a threshold or one or more columns are certain to have a header, the first row is a header row.

      Hi hdb,

      these are very interesting ideas, but I'm not really convinced by this one:

      If the first row has a number, it is not likely to be a header.
      The header could consist in years, month numbers, quarters, test IDs, etc., all appearing to be numerical.

      @ nysus: in general, a very strong principle is "know your data." Dou you know anything about the data you're going to deal with, or is this just a general purpose tool where you can't know in advance anything about the type of your data?

        Most of the data I'm dealing with will be related to people's contact info. But I'm also interested in trying to write a general purpose tool that can be used by others just for the challenge.

        $PM = "Perl Monk's";
        $MCF = "Most Clueless Friar Abbot Bishop Pontiff Deacon Curate Priest Vicar";
        $nysus = $PM . ' ' . $MCF;
        Click here if you love Perl Monks

Re: Useful heuristics for analyzing arrays of data to determine column header
by sectokia (Pilgrim) on Feb 15, 2019 at 10:24 UTC

    For each cell in the table, count the number of items in the same column with that value, and divide by the total number of items in the colum. Call this the cells 'value cardinality'.

    Next determine a type for every cell (int, float, string, path, filename, whatever you think you are likely to find). Like before count the number of cells in the same column with the same type, and divide it by the total number of items in the column. This gives a cells 'type cardinality'.

    Next for each row, multiple all the value cardinalities in that row, giving you a total value cardinatlity for that row. Do the same for type cardinality.

    Finally, check if either the value or type cardinality of the first row is an outlier compare to all the other rows. This can either be a simple check such as either of the values being the lowest, or a more complicated check (such as not only being the lowest value but also being at least 1 standard deviation away from the nearest value when compared to the distribution of cardinality values for the other rows).

    So if you have a table with headers like "Firstname","Lastname","Age", then the type cardinatlity will trigger because of the type difference in age. And if you have a table like "Firstname","Lastname","Sex", then the value cardinality will trigger.

    The hard situation is when you have a header like "Firstname,Middlename,Surname". This is where your success depends entirely on how how much effort you put into type identification - because that is how a human does it.

      Thanks. I will try this out.

      And here is some simple code I have so far. It just loops through the column and collects characteristics of the column into attributes that might be useful. For example, one thing it does is determine if the row is sorted. If it is, and the header is not in the proper sort order, it sets a "has_sort_mismatch" flag. One thing the code accounts for is that the header might be multiple rows anywhere in the first 5 rows of the spreadsheet. That makes things slightly trickier.

      package Data::Column ; use Moose; use Scalar::Util qw(looks_like_number); use Log::Log4perl::Shortcuts qw (:all); use namespace::autoclean; has 'first_cell' => (is => 'ro', isa => 'Defined', lazy => +1, default => 0, writer => '_set_first_cell' ); has 'blank_count' => (is => 'ro', isa => 'Int', lazy => 1, d +efault => 0, writer => '_set_blank_count' ); has 'mixed_case_count' => (is => 'ro', isa => 'Int', lazy => + 1, default => 0, writer => '_set_mixed_case_count' ); has 'upper_case_count' => (is => 'ro', isa => 'Int', lazy => + 1, default => 0, writer => '_set_upper_case_count' ); has 'lower_case_count' => (is => 'ro', isa => 'Int', lazy => + 1, default => 0, writer => '_set_lower_case_count' ); has 'preceding_blank_count' => (is => 'ro', isa => 'Int', lazy => 1, d +efault => 0, writer => '_set_preceding_blank_count' ); has 'following_blank_count' => (is => 'ro', isa => 'Int', lazy => 1, d +efault => 0, writer => '_set_following_blank_count' ); has 'unique_count' => (is => 'ro', isa => 'Int', lazy => 1, d +efault => 0, writer => '_set_unique_count' ); has 'strings_count' => (is => 'ro', isa => 'Int', lazy => 1, d +efault => 0, writer => '_set_strings_count' ); has 'min_length' => (is => 'ro', isa => 'Int', lazy => 1, d +efault => 0, writer => '_set_min_length' ); has 'max_length' => (is => 'ro', isa => 'Int', lazy => 1, d +efault => 0, writer => '_set_max_length' ); has 'numbers_count' => (is => 'ro', isa => 'Int', lazy => 1, d +efault => 0, writer => '_set_numbers_count' ); has 'percent_blank' => (is => 'ro', isa => 'Num', lazy => 1, d +efault => 0, writer => '_set_percent_blank' ); has 'percent_strings' => (is => 'ro', isa => 'Num', lazy => 1, d +efault => 0, writer => '_set_percent_strings' ); has 'percent_numbers' => (is => 'ro', isa => 'Num', lazy => 1, d +efault => 0, writer => '_set_percent_numbers' ); has 'non_blank_count' => (is => 'ro', isa => 'Int', lazy => 1, d +efault => 0, writer => '_set_non_blank_count' ); has 'spaces_count' => (is => 'ro', isa => 'Int', lazy => 1, d +efault => 0, writer => '_set_spaces_count' ); has 'is_all_strings' => (is => 'ro', isa => 'Bool', lazy => 1, +default => 0, writer => '_set_is_all_strings' ); has 'has_sort_mismatch' => (is => 'ro', isa => 'Bool', lazy => 1, +default => 0, writer => '_set_has_sort_mismatch' ); has 'is_all_numbers' => (is => 'ro', isa => 'Bool', lazy => 1, +default => 0, writer => '_set_is_all_numbers' ); has 'is_pure' => (is => 'ro', isa => 'Bool', lazy => 1, +default => 0, writer => '_set_is_pure' ); has 'is_sorted' => (is => 'ro', isa => 'Bool', lazy => 1, +default => 0, writer => '_set_is_sorted' ); has 'is_mixed' => (is => 'ro', isa => 'Bool', lazy => 1, +default => 0, writer => '_set_is_mixed' ); has 'first_five_rows' => (is => 'ro', isa => 'ArrayRef[Int]', la +zy => 1, default => 0, writer => '_set_first_five_rows' ); has 'data' => (traits => ['Array'], is => 'ro', requi +red => 1, isa => 'ArrayRef', default => sub { [] + }, handles => { count => 'count', is_empt +y => 'is_empty', elements => 'elements', get_data => 'get' }, ); sub BUILD { my $s = shift; my ($blank_count, $string_count, $number_count, $strings_with_spaces +) = (0) x 4; my $last_cell; my $count = 0; my %uniques = (); my $sort_order = ''; my $current_sort_order; my $is_sorted = 1; my $min_length = 0; my $max_length = 0; my $first_non_blank_data_cell = 0; my @first_five_rows = (); my $preceding_blank_row_count = 0; my $following_blank_row_count = 0; my $lower_case_count = 0; my $upper_case_count = 0; my $mixed_case_count = 0; foreach my $cell ( $s->elements ) { $count++; my $is_string = 0; if (!$cell) { $blank_count++; push (@first_five_rows, 0) if $count < 6; next; } if (!$first_non_blank_data_cell && $count > 5) { $first_non_blank_data_cell = $count; } push (@first_five_rows, 1) if $count < 6; if (length $cell < $min_length || !$min_length) { $min_length = length $cell; } if (length $cell > $max_length || !$max_length) { $max_length = length $cell; } $uniques{$cell} = 1; if ( !looks_like_number($cell) ) { $string_count++; $is_string = 1; my $has_lower = $cell =~ /[a-z]/; my $has_upper = $cell =~ /[A-Z]/; $mixed_case_count++ if ($has_lower && $has_upper); $lower_case_count++ if $has_lower && !$has_upper; $upper_case_count++ if $has_upper && !$has_lower; } elsif ($cell) { $number_count++; } if ($cell =~ / /) { $strings_with_spaces++; } if ($is_sorted && $count > 4 && $last_cell && ( $cell ne $last_cel +l ) ) { if ($is_string) { if ($cell gt $last_cell) { $current_sort_order = 'asc'; } else { $current_sort_order = 'desc'; } } else { if ($cell > $last_cell) { $current_sort_order = 'asc'; } else { $current_sort_order = 'desc'; } } $sort_order = $current_sort_order if !$sort_order; if ( $sort_order && $sort_order ne $current_sort_order) { $is_sorted = 0; } } elsif ($count > 4) { $last_cell = $cell; } } my $non_blank_count = $s->count - $blank_count; if ($non_blank_count == $string_count && $string_count) { $s->_set_is_all_strings(1); $s->_set_is_pure(1); } elsif ($non_blank_count == $number_count && $number_count) { $s->_set_is_all_numbers(1); $s->_set_is_pure(1); } else { $s->_set_is_mixed(1); } my $first_non_blank_row = 0; my $row_count = 0; foreach my $row (@first_five_rows) { if ($row) { $first_non_blank_row = $row_count; last; } $row_count++; } foreach my $row (@first_five_rows[$row_count + 1, 4]) { if ($row && $following_blank_row_count) { last; } if (!$row) { $following_blank_row_count++; } } my $sort_mismatch = 0; my $first_cell = $s->get_data($first_non_blank_row); my $first_cell_is_all_upper = 0; my $first_cell_is_all_lower = 0; my $first_cell_is_mixed = 0; if (!looks_like_number($first_cell)) { my $first_cell_has_lower = $first_cell =~ /[a-z]/; my $first_cell_has_upper = $first_cell =~ /[A-Z]/; my $first_cell_is_mixed if ($first_cell_has_lower && $first_cell_h +as_upper); if (!$first_cell_is_mixed) { if ($first_cell =~ /[a-z]/) { $first_cell_is_all_lower = 1; } elsif ($first_cell =~ /[A-Z]/) { $first_cell_is_all_upper = 1; } else { $first_cell_is_mixed = 1; } } } if ($is_sorted) { my $first_cell_is_greater; if (looks_like_number $first_cell) { $first_cell_is_greater = $first_cell > $first_non_blank_data_cel +l; } else { $first_cell_is_greater = $first_cell gt $first_non_blank_data_ce +ll; } logd($first_cell_is_greater); if ($first_cell_is_greater && $sort_order eq 'asc') { logd('hi'); $sort_mismatch = 1; } if (!$first_cell_is_greater && $sort_order eq 'desc') { $sort_mismatch = 1; } } my $non_blank = $s->count - $blank_count; $s->_set_non_blank_count($non_blank); $s->_set_has_sort_mismatch($sort_mismatch); $s->_set_mixed_case_count($mixed_case_count); $s->_set_upper_case_count($upper_case_count); $s->_set_lower_case_count($lower_case_count); $s->_set_percent_blank($blank_count / $s->count * 100); $s->_set_percent_strings($string_count / $non_blank * 100); $s->_set_percent_numbers($number_count / $non_blank * 100); $s->_set_spaces_count($strings_with_spaces); $s->_set_blank_count($blank_count); $s->_set_max_length($max_length); $s->_set_min_length($min_length); $s->_set_strings_count($string_count); $s->_set_numbers_count($number_count); $s->_set_unique_count(scalar keys %uniques); $s->_set_preceding_blank_count($first_non_blank_row); $s->_set_following_blank_count($following_blank_row_count); $s->_set_is_sorted($is_sorted) if $s->count > 6; $s->_set_first_five_rows(\@first_five_rows); $s->_set_first_cell($first_cell); } sub has_blanks { my $s = shift; return $s->blank_count > 0; }

      $PM = "Perl Monk's";
      $MCF = "Most Clueless Friar Abbot Bishop Pontiff Deacon Curate Priest Vicar";
      $nysus = $PM . ' ' . $MCF;
      Click here if you love Perl Monks

      I'm documenting some more thoughts on this:

      Once each column is analyzed, I can get an overall probability that a particular row is a header row by multiplying the probabilities that each column in isolation is a header. So: probabilty_col1_is_header * probability_col2_is_header * probability_col3_is_header, etc. When, or if, I get to a row that has a significantly lower overall probability than the previous rows, I can be pretty sure that that row starts the data and that the previous row or rows were headers.

      $PM = "Perl Monk's";
      $MCF = "Most Clueless Friar Abbot Bishop Pontiff Deacon Curate Priest Vicar";
      $nysus = $PM . ' ' . $MCF;
      Click here if you love Perl Monks

      Making some progress on the module. So here's some sample data for a column with the raw count and cardinality value for each unique value in the column:

      $VAR1 = { 'ACTIVE' => { 'count' => 1941, 'value_card' => '0.631630328669053' }, 'INACTIVE' => { 'value_card' => '0.233322486169867', 'count' => 717 }, 'RETIRED' => { 'count' => 414, 'value_card' => '0.134721770257078' }, 'STATUS' => { 'count' => 1, 'value_card' => '0.000325414904002603' } };

      So in this simple case, the 'STATUS' value is unique to this column and is clearly an outlier from the other three possible values. But in fuzzier situations, how would I determine whether 'STATUS' is "1 standard deviation" away from the other value cardinality values?

      $PM = "Perl Monk's";
      $MCF = "Most Clueless Friar Abbot Bishop Pontiff Deacon Curate Priest Vicar";
      $nysus = $PM . ' ' . $MCF;
      Click here if you love Perl Monks

        how would I determine whether 'STATUS' is "1 standard deviation" away from the other value cardinality values? 

        Find the mean (m) and standard deviation (s) over all the "value_card" values from that column (or the combined-over-single-row "value_card" values). The distance of "STATUS"'s "value_card" from another entry's value card, say 'ACTIVE' is their absolute difference. The distance in terms of standard deviations is the previous value divided by the standard deviation. That will tell you how many standard deviations apart they are.

        But this gets less and less trustworthy as the number of value_cards gets smaller. For example, this online outlier tool sees 1 outlier but it is not what we think it is!!! https://www.graphpad.com/quickcalcs/grubbs2/ (got it from How to best eliminate values in a list that are outliers which maybe relevant to your problem)

        bw, bliako

        Headers are usually in the first line, so if that sigle "STATUS" is on another line than the first, it isn't a header...
Re: Useful heuristics for analyzing arrays of data to determine column header
by karlgoethebier (Abbot) on Feb 15, 2019 at 20:05 UTC
    ...autodetect...malformed data...try...

    Skip the first line. Insist on sound data/specs. Don’t waste your time. Regards, Karl

    «The Crux of the Biscuit is the Apostrophe»

    perl -MCrypt::CBC -E 'say Crypt::CBC->new(-key=>'kgb',-cipher=>"Blowfbish")->decrypt_hex($ENV{KARL});'Help

      I agree with you but ... if data has no clean header (which is the simplest task in data-collection) then I suspect some of the rows will not be cleaned either. For example, there might be missing fields in rows, truncated or overflowed fields (because of some weird instrument error or network error or conversion error) or formally correct fields but qualitatively wrong (I mean an uncalibrated instrument etc.). So, if (yet another) outlier detection module comes out of this, provided nysus has the time, then let it be, I say.