Useful heuristics for analyzing arrays of data to determine column headerby nysus (Vicar)
|on Feb 14, 2019 at 23:06 UTC||Need Help??|
nysus has asked for the wisdom of the Perl Monks concerning the following question:
I'm writing a Perl module to try to autodetect whether a spreadsheet/csv file has a header row. It's a somewhat tricky problem especially when trying to factor in the kind of malformed data people might feed in. I'm sticking with simple cases for now. I'd like to try to do some statistical analysis of the data to help me. Unfortunately, my knowledge of statistics is very weak. I'm feeling my way in the dark. So take this sample column for example:
It's obviously a column of states. One thing that might jump out to a computer is that the length of the first row is 5 letters while the rest of the rows are two letters. Things can of course get fuzzier. The first row might have 5 letters while the rest of the columns have 2 OR 3 letters. Other tell-tale signs might be the header column is a string while the column is full of numbers. Or the header column might be named "STATUS" while the data might only contain the words "ACTIVE" or "RETIRED." The size of the column is also an important factor. If there is a lot of data, any statistical approach will likely be more accurate.
The Math::NumberCruncher module has some useful functions I think I could use like standard deviation to help me analyze things like how diverse the dataset is for a certain property (length, number of unique values, etc.). But I'm not really sure how I might apply it to be useful in a practical way. I found this interesting article was useful but there's no code and not being familiar with statistics, I'm still not clear on exactly how that analysis was done.
I intend to analyze each column and try to come up with some kind of "likely has header" factor based on the analysis. If it looks like most columns have a header, then it will determine that the spreadsheet has a header row.
Sorry this question is so open-ended. But any tips/advice you can think of would be appreciated.
$PM = "Perl Monk's";