I want to build a generic Perl module for handling and analysing biomedical character separated datasets and which can, most certain, be used on any kind of datasets that contain a mixture of categorical (A,B,C,..) and continuous (1.2,3,881..) and identifier (XXX1,XXX2...). The plan is to have people initialize the module and then use some arguments to point to the data file(s), the place were the analysis reports should be placed and the structure of the data.
By structure of data I mean which variable is in which place and its name/type. And this is where I need some enlightenment. I am baffled how to do this in a clean way. Obviously, having people create a simple schema file, be it XML or some other format would be the cleanest but maybe not all people enjoy doing something like this.
The solutions I can think of are:
- Create a configuration file in XML or similar and with a prespecified format.
- Pass the information during initialization of the module
- Use the first row of the data as headers and try to guess types (ouch)
Surely there must be a "canonical" way of doing this that is also usable and efficient. It should also reflect the fact that these datsets change over time with new variables added or old ones deleted and that people might have a subset of the data at any time and not the entire thing.