Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Perl and Bioinformatics

by biohisham (Priest)
on Feb 15, 2010 at 12:52 UTC ( [id://823275]=perltutorial: print w/replies, xml ) Need Help??

By BioLion * biohisham

BioPerl, the Perl interface to Bioinformatics (biological data analysis using computers), is a collection of object-oriented modules that enable life science data analysis. Tasks such as sequence manipulation, software generated reports processing and parsing can be accomplished using many of the different BioPerl modules.

These modules are strong that they minimize the need to write lengthy code to get the job done, also they are flexible, extendible and generalized to be reusable across many domains. Here, we are shedding light on some of the Bioinformatics aspects where Perl can be used in addition to some of the relevant resources that can be of benefit to Monks. We also address Monks from Biology/Bioinformatics backgrounds - who are new to the Monastery - need to communicate effective Perl questions to enhance the level of interactivity between the diversified backgrounds of other Perl Monks members.

Audience:

Perl Monks and Everyone with an interest.

Introduction & Justification:

Perl has come to cover many areas of IT and has been dubbed the 'glue' for that matter. Perl has also contributed to Biology, big time, it saved the human genome project and not only that, it has continued to be the mainstay of much bioinformatics munging and analysis, playing no small part in the burgeoning ‘*omics’ sciences.

The increasing number of bioinformatics related Perl problems that seem to be coming up in the Monastery, and the confusing and disparate resources available on the internet contribute a great deal to making BioPerl fearful or at least "perl-plexing"Â…

PerlMonks plays a great role in the evolution of Perl, it has encouraged many to join up the community and exchange knowledge in a place of utmost cohesion between its members and thus BioPerl coders can be equally encouraged to participate and share their knowledge and code.

The BioPerl suite of modules revolves around sequence acquisition, parsing and retrieval from public databases and automating various tasks related to studying these sequences BioPerl HOWTOs. Think this is simple? Think again - CODE.GOOGLE.COM tells us there are 3,666,478 lines of code to get your head round!

A sequence is just a text string in a certain format (this format is described in the beginning of the text file containing that sequence) that represents either a gene or a protein, the alphabet of the sequence with regard to genes is but a combination of four letters (ACGT) and sometimes U (replaces T) and N (for aNything). A gene represents a sequence too, so doesn't negate the fact that it still has the aforementioned alphabet. ('GATTACA' is a sci-fi movie name that has these four letters). The Protein alphabet, on the other hand, comprise 20 letters.

Working with either type of sequences (DNA or protein) can involve:

  • sequence comparison (Sequence Alignment): two or more sequences are compared against each other to evaluate how similar they are, and where they are similar.
  • Sequence manipulation - in-place modification, concatenation, reverse complementing, etcÂ…
  • BLASTing (Database Search for similar sequences):

Description - what this is and isn't:

So, while this isn't intended to do the job of the extensive BioPerl docs, or many reference points out there, it will hopefully be a starting platform for those looking to delve deeper into using Perl in bioinformatics related tasks and also assisting Monks in becoming more accessible to BioPerl questions: Facilitating the back and forth that makes Perl and the Monastery so special.

It is also to highlight the interesting problems that bioinformaticians have to deal with - not all are BioPerl related(!) and can often involve huge, diverse datasets. And we hope that these sorts of challenges will tempt a few talented programmers to get involved.

Tips on posting bioinformatics type questions in the Monastery:

REMEMBER: a well formulated question will garner better and quicker response. so please go through the following whenever you notice that your question or parts thereof don't look like how you expected after you have hit the "preview" button.

These last two will help make sure your code is presentable, and their use should be considered Good coding practice.

How do I post a question effectively? is of particular relevance for specialist - such as bioinformatics - questions for it highlights the importance of well formulated questions:

Examples of bad and good questions:

Since not all monks are familiar with biology terms and not all monks are into bioinformatics, so as much as possible, use clear language that describes what your problem is and use biology terms only when relevant, better still, post the part of your Perl code that describes the problem or demonstrate the problem in Perl.

BAD QUESTIONS:

"I have a DNA sequence that I want to BLAST and I tried Bio::Tools::Run::StandAloneBlast but it did not work how can I do that? "

OR:

“I am trying to translate my coding sequence, I can work out the tRNA lookup table, but I can’t break up the sequence into codons - any ideas?

These sorts of questions might not invite a quick response and would confuse the monks sp their response would be either trying to extract words from you to get you to explain it better, make wild guesses that would confuse you the more or ignore your question rather than BLASTing on you. Better to think about what you are trying to actually do and think about how this is a Perl problem.

This leads to an important point - often overlooked - of providing test data (just enough - 3 cases of input, not the whole file, and if it is in a particular format - say which or provide an example of its layout !), and if you are really stuck, what output you want. This greatly helps people grasp what you are doing and also test any code they produce.

A GOOD QUESTION:

I am trying to convert a string (a DNA sequence) into a series of three-letter sub-strings– to do that I have written the following code but I failed to make the substrings overlap by moving one letter at a time from the original sequence in the forward direction.
#original seq: accgttac #required output: acc #first substring ccg cgt tta tac #fifth substring
Here is my non-functioning code:
#!/usr/local/bin/perl use strict; use warnings; for(<DATA>){ print substr ($_,0,3),"\n" ; } __DATA__ accgttac

Now that seemed like an ideal question, clear wordings, examples of input and desired output and the code involved if any so that testing the respondents code on the provided data is made possible.

Finally, always check to see if your problem hasn’t been answered before - learn to love Super Search and google… There are also links to discussions in the Monastery that may be of interest!

Good coding practice:

Many bioinformaticians are new to coding and can be guilty of certain malpractices, so your code should be readable, self-descriptive and properly indented and commented. Good coding practices are critical point checks, they can alert you to avoid potential errors, dangerous coding behavior and enable you reduce debugging time and increase code efficiency and re-usability. And as always, use warnings and strict, check for errors etcÂ… because you never know what this code could be used for! Maybe some IO error means that a potential cancer biomarker is missed (extreme example, but point remains!).

Also - remember that posted nodes can be edited at a later point -if you are signed in as yourself and not under Anonymous Monk- to encompass suggestions, changes to code, what course was finally decided etc... Remember that it is considered good practice to mark any changes with ‘Update:’.

Tips on Answering BioPerl Questions:

This is still under development and requires contributions from our generous Monks

Typical problems and solutions:

INSTALLATION:

A frequent problem is the installation of BioPerl, this in itself is not difficult if certain caveats are attended to. If you are familiar with Installing Modules then you are good to go.

Windows OS (ActiveState PPM Manager):

Note that there is some difference between the BioPerl Installation Requirements on Linux and BioPerl Installation Requirements on Windows and that not all of BioPerl is available on Windows hence you need to add the following repositories to the ActiveState PPM manager to be able to install the full package from different sources.

  • BioPerl-Release Candidates.
  • BioPerl-Regular Releases.
  • Kobes.
  • Bribes.
Adding these repositories is described in Installation Requirements on Windows, You might also want to check PPM Repository Management to enhance PPM efficiency after adding the above repositories.

Strawberry Perl:

Installing BioPerl in Strawberry Perl for the versions 5.8.* and 5.10.* is direct forward

  1. Invoke the CPAN Client
  2. run Start -> Program Files -> Strawberry Perl -> CPAN Client
  3. From the CPAN Client interface type:
  4. CPAN> install BioPerl
  5. select the default options.
Furthermore, it seems that as of January, 2010 the folks at Strawberry Perl are planning a Strawberry Perl Professional Distribution that comes with BioPerl bundled within the default installation which would eliminate the requirement for its manual installation.

Good Ol' CPAN:

Using CPAN to install BioPerl could be the easiest way for some experienced BioPerl programmers.

Often IO problems start with the sequence having non-canonical letters, punctuation, or whitespace left in from reading in the sequence, so perlretut and perlop for help on regexes, and substitutions (s///) which are one way of checking for / replacing naughty characters.

Modules of Interest-(Module Reviews Needed):

There are also a host of extremely useful modules for general data handling.

These last ones also highlight that bioinformatics is ultimately about ‘getting an answer’!

Got Data?:

So now you have a good start on the Perl side, but want some data to play with? Much of bioinformatics revolves around the integration of large datasets in an attempt to draw out relationships, ultimately giving biological meaning to observed phenomena.

Fortunately, biology naturally lends itself to informatics, with known hierarchies and inter-relations mirroring OO structuring, and the sheer abundance of data makes the challenge interesting. Here are a few possible sources of publicly available data:

  • ENSembl:
  • Data can be accessed using either cpan tools or the perl API.
  • http://www.biomart.org/:
  • Providing access to a vast number of other databases.
  • NCBI EUtilities:
  • Again Perl based and a source of a huge amount of information. They also have a lot of documentation which explain about what the data is...
  • Gene Ontology:
  • Gene Ontology is a "structured, controlled vocabulary" accessed as "a relational database comprising the GO ontologies and the annotations of genes and gene products to terms in the GO." This sort of annotation is becoming a very popular way of approaching problems like "what commonalities link my group of highly expressed genes?". Code is already appearing on cpan for accessing and querying GO data.
  • The Gene Expression Omnibus:
  • Again from the NCBI, this is a repository of actual genome-wide experimental data, fully annotated. Programmatic access is still fairly rudimentary, but once you have the data, the sky is the limit. Publication is dependent on making your data publicly available, so new datasets are continuously appearing.
  • European Molecular Biology Open Software Suit (EMBOSS)
  • A stable publicly available package that provides cross-platform user-friendly collection of hundreds of programs to perform tasks ranging from basic sequence alignment to publication presentation. A powerful asset for sure!
Thanks to erix for suggestions.

Currently available resources:

For both biologist and programmers, here are a few resources for those of you who want to read more.

Any other recommendations for free and open source resources are welcome!

Nodes of interest:

Many great discussions have taken place in the monastery and this is just to highlight a few of the lessons learned there. Super Search will hopefully lead you to more specific answers too!

Jobs:

Lastly, if you are really interested, there are several good forums / sites that advertise jobs within bioinformatics and related science. Personally, I have found job-hunting to be no easy task, so here is a few of the better things I have stumbled upon:

Further Insight:

If you can suggest ideas, invite/offer Modules review or share code addressing a certain aspect of BioPerl feel free to come forward with it.

If you intend to develop libraries in BioPerl, a grip on Object Oriented Programming is mandatory.

Acknowledgment:Thanks to planetscape, erix, marto and GrandFather for their contributions.

Original RFC published RFC: Bioinformatics Tutorial on Feb 07, 2010 by BioLion

Replies are listed 'Best First'.
Re: Perl and Bioinformatics
by MadraghRua (Vicar) on Feb 16, 2010 at 20:31 UTC
    Guys

    I enjoyed the node. Here is a suggested topic that would be worth pursuing for the biological crowd

  • Data Structures. So BioPerl gives you some pretty cool data structures that are easy to handle. Its when you run into custom structures that you get problems. For instance if I'm working with E.coli, I have ~5e6 bp of DNA - 1e7bp if I'm working on each nucleotide on both strands. How do I manage working with an analysis that needs to annotate every base, eg working with coverage from next gen analysis. Using arrays or hashes gets ugly because you will typically run out of memory. I'm not aware of an out of the box BioPerl solution, though I could stand to be corrected. You could use pack and unpack. You could use DB::File. You might even go to Berkeley DB. But the problem is general enough that it would be useful to see one or more tutorials on what to do for these larger analysis problems that are beyond simple scripts and not necessarily part of the BioPerl toolbox.
  • MadraghRua
    yet another biologist hacking perl....

      I think this is a really good suggestion - certainly a topic that is becoming more and more relevant. We tried to touch on this in the text, not to knock BioPerl, but their objects are generally huge (even for simple things) mainly because of the need to ensure that they all mesh well together, and ensuring backwards compatibility, amongst other things. As biohisham says, this interoperability of the whole suite is it's greatest strength, but certainly can be a weakness too.

      I have generally taken to rolling my own stripped down objects, and using caching when things get really hairy.

      I asked a question on this sort of topic before ( Storable Objects ), and for that problem I did end up setting 'store-points' where I would cache the appropriate info as certain critic points. This worked, but certainly isn't applicable to all cases, especially ones like you mention where the processing isn't so linear.

      If you have an example problem (and solutions you tried), please post it here, it would be good to get discussion going - as I said, I think this is a very relevant problem.

      Just a something something...
Re: Perl and Bioinformatics
by Fortinbras (Novice) on Mar 26, 2010 at 02:35 UTC
    Wow -- excellent overview!

    Re: answering BioPerl questions, the BP list bioperl-l@lists.open-bio.org is a great place to see good examples by BP gurus. Very generally, we

    1) prod politely for code/data if necessary;

    2) give short answers to FAQs using links to an appropriate HOWTO or Scrapbook example;

    3) give longer answers if the questions are deep, usually with code examples;

    4) start long developer discussions if the questions are interesting;

    5) admit guilt if the problem is actually a bug, then fix it;

    6) generally assume the supplicant is completely ignorant, but infinitely intelligent.

    I highly encourage monks to check out the wiki for easily distributed packets of sageliness.

    cheers FB

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perltutorial [id://823275]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others having an uproarious good time at the Monastery: (3)
As of 2024-12-02 20:51 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found