Re: Strategy for simple data management

Pragmatically speaking, you will have to approach tasks like this one in several very-distinct “layers” ...

The first step is to get all of the data from any spreadsheet-file into a common data store ... e.g. an SQL database (SQLite file?). Grab it exactly as-is, and arrange this data-intake script so that you are able to verify (from the database entries) that all of the available spreadsheets have in fact been imported ... when, by whom, and so on. If you re-import a file that has already been previously imported, all of the preceding data should be cleanly replaced. After all, the greatest threat to the data-integrity of the entire study is that data is missing, or that it is duplicated.
The next step is standardization: without altering the original “capture” data, this step converts apples to consistent oranges. This process, once again, must be entirely reproducible. It should create new, standardized data-tables from the data-capture originals. If any of the input data does not conform to whatever validation rules you can come up with, it should be very-clearly flagged as non-conforming.
The final step is ... whatever your analysis needs to be. This step will rely very heavily upon all of the preceding steps to have delivered a data-set that is both complete and consistent, and/or to have clearly “blown the whistle” if something is wrong ... even if (especially if?) the source of the inconsistency is “the work of an experimenter.” Always bear in mind that “only the computer itself” can be relied-upon to detect omissions or inconsistencies in a mass of collected data. The scripts that comprise your pipeline must be not ony reliable but error-aware.

You can certainly use Perl for each of these steps. (In a Windows environment, yes, Perl does OLE...) Unfortunately, the exact nature of what needs to be built, and of how to correctly use what has been built, will be completely determined by what you need to do in this project.


Perl: the Markov chain saw
	PerlMonks