Pragmatically speaking, you will have to approach tasks like this one in several very-distinct “layers” ...
- The first step is to get all of the data from any spreadsheet-file into a common data store ... e.g. an SQL database (SQLite file?). Grab it exactly as-is, and arrange this data-intake script so that you are able to verify (from the database entries) that all of the available spreadsheets have in fact been imported ... when, by whom, and so on. If you re-import a file that has already been previously imported, all of the preceding data should be cleanly replaced. After all, the greatest threat to the data-integrity of the entire study is that data is missing, or that it is duplicated.
The next step is standardization: without altering the original “capture” data, this step converts apples to consistent oranges. This process, once again, must be entirely reproducible. It should create new, standardized data-tables from the data-capture originals. If any of the input data does not conform to whatever validation rules you can come up with, it should be very-clearly flagged as non-conforming.
The final step is ... whatever your analysis needs to be. This step will rely very heavily upon all of the preceding steps to have delivered a data-set that is both complete and consistent, and/or to have clearly “blown the whistle” if something is wrong ... even if (especially if?) the source of the inconsistency is “the work of an experimenter.” Always bear in mind that “only the computer itself” can be relied-upon to detect omissions or inconsistencies in a mass of collected data. The scripts that comprise your pipeline must be not ony reliable but error-aware.
You can certainly use Perl for each of these steps. (In a Windows environment, yes, Perl does OLE...) Unfortunately, the exact nature of what needs to be built, and of how to correctly use what has been built, will be completely determined by what you need to do in this project.
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.
| & || & |
| < || < |
| > || > |
| [ || [ |
| ] || ] ||