Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

A generic biomedical data processing library

by spiros (Beadle)
on Mar 21, 2010 at 13:26 UTC ( #829939=perlquestion: print w/replies, xml ) Need Help??

spiros has asked for the wisdom of the Perl Monks concerning the following question:

Hello,

I want to build a generic Perl module for handling and analysing biomedical character separated datasets and which can, most certain, be used on any kind of datasets that contain a mixture of categorical (A,B,C,..) and continuous (1.2,3,881..) and identifier (XXX1,XXX2...). The plan is to have people initialize the module and then use some arguments to point to the data file(s), the place were the analysis reports should be placed and the structure of the data.

By structure of data I mean which variable is in which place and its name/type. And this is where I need some enlightenment. I am baffled how to do this in a clean way. Obviously, having people create a simple schema file, be it XML or some other format would be the cleanest but maybe not all people enjoy doing something like this.

The solutions I can think of are:
  • Create a configuration file in XML or similar and with a prespecified format.
  • Pass the information during initialization of the module
  • Use the first row of the data as headers and try to guess types (ouch)
  • Surely there must be a "canonical" way of doing this that is also usable and efficient. It should also reflect the fact that these datsets change over time with new variables added or old ones deleted and that people might have a subset of the data at any time and not the entire thing.

    Thank you!
    • Comment on A generic biomedical data processing library

Replies are listed 'Best First'.
Re: A generic biomedical data processing library
by almut (Canon) on Mar 21, 2010 at 13:46 UTC
    ... By structure of data I mean which variable is in which place and its name/type.

    Maybe specify a template similar in spirit to Perl's pack, e.g. something like

    "a,a,c,i,a,c,..." # to be used in: my ($foo, $bar, $baz, ...) = unpack_line("a,a,c,i,a,c,...", $line) +; # or my $parser = My::Module->new( template => "a,a,c,i,a,c,..." ); my ($foo, $bar, $baz, ...) = $parser->read_line();

    (where a / c / i would stand for categorical / continuous / identifier, etc.)

    Or together with names:

    "foo:a, bar:c, baz:i, ..." my %fields = unpack_line("foo:a, bar:c, baz:i, ...", $line);

    Of course, there are many other ways to approach this, so possibly ask your potential users what interface or mini template language they would prefer...

Re: A generic biomedical data processing library
by roboticus (Chancellor) on Mar 21, 2010 at 13:54 UTC

    spiros:

    Too many people have differing views on how such a standard should be built, which is why we have so many different file types. In many cases new file types are justified. In others they're simply different with no real benefit. While there are some interesting challenges in designing a (good) new file type to replace a collection of others, the big challenge will be getting enough buy-in from everyone in your problem domain to actually start using it. Without such buy-in, instead of having n different file formats, you'll simply end up with n+1 different file formats.

    ...roboticus

Re: A generic biomedical data processing library
by BrowserUk (Pope) on Mar 21, 2010 at 13:50 UTC
    Use the first row of the data as headers and try to guess types (ouch)

    Why not add a second header line that identifies the types of the fields labelled in the first line?

    Or, add prefixes or suffixes to the headers: xyz:int, pqr:string, abc:float. And for file formats that do not incorporate a header line, have the user pass in that line as an initialisation parameter.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Why not add a second header line that identifies the types of the fields labelled in the first line?
      I find it rarely a good idea to change anything in external data files and for sure, allowing the users to change a data-file is courting disaster.

      Before you know it, some data is inadvertantly changed and a wrong diagnosis follows.

      CountZero

      A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

        allowing the users to change a data-file is courting disaster

        And who are "the users" in this case? "Disaster" is a little dramatic don't you think?


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
      That is a good idea, the only problem I am thinking of is that data tends to be multidemensional so doing this for 150 variables might be tedious. I would also like to have it in a simple manner which is readable - I am leaning towards YAML for this.
        I am leaning towards YAML for this.

        Just a general note.  During my last job (in a lab mostly frequented by psychologists, linguists, etc.) I wrote a suite of modules for EEG/ERP analysis, and my general conclusion was that if you make the user interface too complex (as is unfortunately sometimes required for generic solutions), people simply aren't going to use the tool — in particular, if the entry threshold is high, and it doesn't come with lots of ready to use cut-n-paste examples.

        In other words, using YAML would be fine if they know it already, but otherwise they might not be willing to learn it  (which could mean - at least if you work in the same lab - you'll always be the one eventually writing the code for them :)

        Dealing with 150 fields is always going to be tedious. Whether you spread it horizontally over a single line, or vertically over 150 lines.

        The nice thing about the prefix/suffix idea, is that it can be embedded in a standard Xsv file and normal Xsv handling can still be used. If the processor uses the header line, the fields just carry some extra information. If it discards ths headers, it just gets discarded. If it uses the field names for processing, you only need preprocess the first line to strip the suffixes to allow it to still work.

        Things that a YAML/XML/Other format description would never allow.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: A generic biomedical data processing library
by TGI (Parson) on Mar 22, 2010 at 21:00 UTC
      And your point is? I posted it first there and the only reply I had was "check CPAN for modules". I usually ask things on SO to get more generic answers and then some times I tend to ask in more language-oriented places. I apologise if this caused you distress?

        No criticism is intended. I'm just tying the related discussions together, so that others have the potential benefit of seeing both.

        I don't think cross posting is wrong. In fact, I cross-post things from time to time. I label the posts as cross posts, so that people can easily see the related discussion. Which is exactly the same reason I labeled your post.

        PM and SO are very different places, and it is often helpful to get opinions in both places.

        I was rather brief in my post, and I can see how that could be interpreted as a dismissive gesture. It was not intended as such. I apologize for any offense given.


        TGI says moo

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://829939]
Approved by Old_Gray_Bear
Front-paged by ww
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others wandering the Monastery: (4)
As of 2021-06-14 10:05 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    What does the "s" stand for in "perls"? (Whence perls)












    Results (62 votes). Check out past polls.

    Notices?