Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked

Strategy for simple data management

by bojinlund (Parson)
on Jan 10, 2014 at 15:13 UTC ( #1070128=perlquestion: print w/replies, xml ) Need Help??
bojinlund has asked for the wisdom of the Perl Monks concerning the following question:


I am working in a small research project with limited resource, where a lot of data (measured and calculated values) need to be handled. Data comes from scientific experiments and are usually first stored in Excel spreadsheets. One spreadsheet contains data from a few experiments. For each experiment there is typically some basic information (50 data items) and a number of time series of measured values (10 series, 100 points in time and 30 measured values for each time). There exists about 100 old spreadsheets and some hundred new will be created. The old spreadsheets are similar but not standardised.

Representation of a quantity

A quantity is a property that is measured. Example: mass, length, time. A unit is a standard quantity against which a quantity is measured. Example: gram, metre, second; which are units of the above quantities.

This can also be described by:

A = {A} * [A]
A is the symbol for the quantity, {A} symbolizes the numerical value of A, and [A] represents the corresponding unit. (e.g., A = 300 * m = 0.3 * km). {A} is often called the measured value.

Example of thing need for a quantity are:
  • Value
  • Unit
  • Property measured/calculated (temperature of air)
  • Measuring/calculation method
Representation of unit

The Unit must be represented in a consistent way. For Units from The International System of Units (SI) can the exponents of the base units be used. SI has 6 base units (meter [m], kilogram [kg], second [s], … ). The derived units can be expressed using exponents of the base units (area [m2], speed [ms-1]).

Goals and restrictions

Primary goals are:
  • In a consistent way store the data from the experiments
  • Make it possible for Perl programs to use the stored data
  • Make it possible for Perl programs to create new sets of stored data
  • Make the data available to JavaScript programs

MS Window systems are used.

Design ideas

  • Use one text file to represent data from one experiment. The text files are stored in a directory structure in a normal file system.
  • For each spreadsheet there is one file associated, containing the additional information needed to create “standardised” representation from a spreadsheet. A Perl scripts is used to create the “standardised” representation.
  • A temporary database is created for each purpose. (I do not think it is possible to have one on-line database with everything.) A Perl script load the temporary databases from a number of “standardised” files.
  • A database implementation which can provide accessed to Perl and JavaScript program is selected.
  • Make it possible from Perl to create new or augment “standardised” files. (Write to the database and then create “standardised” files with the new and updated data.)


  • Are there any similar Perl-based system already implemented?
  • Is there a better design strategy? What should be changed?
  • What type of database can be used? Is MongoDB possible and suitable?
  • Suitable format for the “standardised” files.
  • Perl modules for handling SI units and conversion between such unit.

Replies are listed 'Best First'.
Re: Strategy for simple data management
by sundialsvc4 (Abbot) on Jan 10, 2014 at 23:25 UTC

    Pragmatically speaking, you will have to approach tasks like this one in several very-distinct “layers” ...

    1. The first step is to get all of the data from any spreadsheet-file into a common data store ... e.g. an SQL database (SQLite file?).   Grab it exactly as-is, and arrange this data-intake script so that you are able to verify (from the database entries) that all of the available spreadsheets have in fact been imported ... when, by whom, and so on.   If you re-import a file that has already been previously imported, all of the preceding data should be cleanly replaced.   After all, the greatest threat to the data-integrity of the entire study is that data is missing, or that it is duplicated.
    2. The next step is standardization:   without altering the original “capture” data, this step converts apples to consistent oranges.   This process, once again, must be entirely reproducible.   It should create new, standardized data-tables from the data-capture originals.   If any of the input data does not conform to whatever validation rules you can come up with, it should be very-clearly flagged as non-conforming.
    3. The final step is ... whatever your analysis needs to be.   This step will rely very heavily upon all of the preceding steps to have delivered a data-set that is both complete and consistent, and/or to have clearly “blown the whistle” if something is wrong ... even if (especially if?) the source of the inconsistency is “the work of an experimenter.”   Always bear in mind that “only the computer itself” can be relied-upon to detect omissions or inconsistencies in a mass of collected data.   The scripts that comprise your pipeline must be not ony reliable but error-aware.

    You can certainly use Perl for each of these steps.   (In a Windows environment, yes, Perl does OLE...)   Unfortunately, the exact nature of what needs to be built, and of how to correctly use what has been built, will be completely determined by what you need to do in this project.

Re: Strategy for simple data management
by basiliscos (Pilgrim) on Jan 10, 2014 at 18:11 UTC

    Very simple approach: just store your experiment files json fomat, 1 file per experiment.

    And in the file just an array of json-hash like

    [ { value: "..", unit: "..", ...} ];

    Of course, I suppose, that you will not need any data search/aggregation etc., esp. if you are going to implement that analysis in pure JS.

Re: Strategy for simple data management
by djerius (Beadle) on Jan 15, 2014 at 15:42 UTC
Re: Strategy for simple data management
by tangent (Priest) on Jan 17, 2014 at 01:58 UTC
    With regard to how you store your data in a standardised format I would suggest you consider CSV files.
    • Easy to manipulate with Perl:

    • Easy to use with javascript - many javascript libraries can read and manipulate CSV files. For example, D3.js can pull in a CSV file and generate an HTML table, or create interactive charts with transitions and interaction - I think you will find that library very useful.
    • Easy to import and export to/from spreadsheets.
    • Easy to backup - you can store them on a thumb drive or optical media.
    • Easy to 'debug' - when you encounter problems with your data you can open your file with a text editor and see exactly what data you have.
    • Easy to share via email or in the cloud.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1070128]
Approved by Arunbear
Front-paged by Arunbear
[Corion]: Oof - "Anyway, if it is not obvious, please keep this discussion entirely confidential for now.", as seen on a public mailing list ;)
[Discipulus]: uh no choroba.. i do not practice much with online putyour image sites:can you suggest one (that possibly remove the pic after few days)?
[choroba]: Corion a bug report?
[choroba]: I have no idea :-( I used to post to, but they don't seem to feature "private" pictures in the free version now
[Discipulus]: dazz i'm not an experts but i think it would be possible

How do I use this? | Other CB clients
Other Users?
Others lurking in the Monastery: (7)
As of 2017-03-27 07:48 GMT
Find Nodes?
    Voting Booth?
    Should Pluto Get Its Planethood Back?

    Results (317 votes). Check out past polls.