scratchpad
BioLion
<p><b>RFC (Perl and Bioinformatics) [BioLion] * [biohisham]</b></p>
<p><b>Audience:</b> Perl Monks and Everyone with an interest. </p>
<p>
<b>Introduction & Justification:</b>
Perl has come to cover many areas of IT and has been dubbed the 'glue' for that matter. Perl has also contributed to Biology, big time, it [http://www.bioperl.org/wiki/How_Perl_saved_human_genome|saved the human genome project] and not only that, it has continued to be the mainstay of much bioinformatics [http://sysbio.harvard.edu/csb/resources/computational/scriptome/UNIX/|munging] and analysis, playing no small part in the burgeoning ‘*omics’ sciences.</p>
<p>
The increasing number of bioinformatics related Perl problems that seem to be coming up in the Monastery, and the confusing and disparate resources available on the internet contribute a great deal to making BioPerl fearful or at least "perl-plexing"…
</p>
<p>
PerlMonks plays a great role in the evolution of Perl, it has encouraged many to join up the community and exchange knowledge in a place of utmost cohesion between its members and thus BioPerl coders can be equally encouraged to participate and share their knowledge and code.
</p>
<p>
<b>Description - what this is and isn't:</b>
</p><p>So, while this isn't intended to do the job of the extensive BioPerl docs, or many reference points out there, it will hopefully be a starting platform for those looking to delve deeper into using Perl in bioinformatics related tasks and also assisting Monks in becoming more accessible to BioPerl questions: <i>Facilitating the back and forth that makes Perl and the Monastery so special</i>.
</p>
<p>It is also to highlight the interesting problems that bioinformaticians have to deal with - not all are BioPerl related(!) and can often involve huge, diverse datasets. And we hope that these sorts of challenges will tempt a few talented programmers to get involved.</p>
<p>
<b>Tips on posting bioinformatics type questions in the Monastery:</b>
</p>
<p>
Please go through the following whenever you notice that your question or parts thereof don't look like how you expected after you have hit the "preview" button and <b><i>remember;</i> a well formulated question will garner better and quicker response.</b>
<ul>
<li>[How do I post a question effectively?] </li>
<li>[Markup in the Monastery]
</li>
<li>[How do I compose an effective node title?]</li>
<li>[ cpan://Perl::Tidy] and [cpan://Perl::Critic]</li>
These last two will help make sure your code is presentable, and their use should be considered <b>Good Coding Practice</b> (see below).
</ul>
<p>[How do I post a question effectively?] is particularly relevant for specialist, such as bioinformatics, questions. Here we try to highlight <b>the importance of well formulated questions</b>: </p>
Not all monks are familiar with biology terms and not all monks are into bioinformatics, so as much as possible, use clear language that describes what your problem is and use biology terms only when relevant, better still, post the part of your Perl code that describes the problem or demonstrate the problem in Perl.
<p>
<blockquote>"<i>I have a DNA sequence that I want to BLAST and I tried Bio::Tools::Run::StandAloneBlast but it did not work how can I do that? </i>"
<br><b><i>
OR
</i></b><br>
“<i>I am trying to translate my coding sequence, I can work out the tRNA lookup table, but I can’t break up the sequence into codons - any ideas?</i>”</blockquote>
</p>
<p>
These sorts of questions invite down-voting and confuse monks and their response would be either trying to extract words from you to get you to explain it better, make wild guesses that would confuse you the more or ignore your question rather than BLASTing on you. Better to think about what you are trying to actually do and think about how this is a <i><b>Perl</b></i> problem.
</p>
<p>
This leads to an important point - often overlooked - of providing test data (just enough - 3 cases of input, not the whole file, and if it is in a particular format - say which or provide an example of its layout !), and if you are really stuck, what output you want. This greatly helps people grasp what you are doing and also test any code they produce.
</p>
<p>
<blockquote><i> I am trying to convert a string (a DNA sequence) into a series of three-letter sub-strings– to do that I have written the following code but I failed to make the substrings overlap by moving one letter at a time from the original sequence in the forward direction.</i>
<br>
<c>
#original seq
accgttac
#required output
acc #first substring
ccg
cgt
tta
tac #fifth substring
</c>
<i>Here is my non-functioning code</i>
<c>
#!/usr/local/bin/perl
use strict;
use warnings;
for(<DATA>){
print substr ($_,0,3),"\n" ;
}
__DATA__
accgttac
</c>
</p></blockquote>
<p>Now that seemed like an ideal question, clear wordings, examples of input and desired output and the code involved if any so that testing the respondents code on the provided data is made possible.</p>
<p>Finally, always check to see if your problem hasn’t been answered before - learn to love [Super Search] and [http://www.google.co.uk/search?hl=en&q=site%3Aperlmonks.org&btnG=Search&meta=&aq=f&oq=|google]… There are also links to discussions in the Monastery that may be of interest!</p>
<p>
<b>Good coding practice:</b>
</p>
<p>
Many bioinformaticians are new to coding and can be guilty of certain malpractices, so your code should be readable, self-descriptive and properly indented and commented. Good coding practices are critical point checks, they can alert you to avoid potential errors, dangerous coding behavior and enable you reduce debugging time and increase code efficiency and re-usability. And as always, [perldoc://use] [perldoc://warnings] and [perldoc://strict], check for errors etc… because you never know what this code could be used for! Maybe some IO error means that a potential cancer biomarker is missed (extreme example, but point remains!).
</p>
<p>Also - remember that posted nodes can be edited at a later point to encompass suggestions, changes to code, what course was finally decided etc... Remember that it is considered good form to mark any changes with ‘<b>Update:</b>’. </p>
<p>
<b>Tips on Answering BioPerl Questions:</b>
</p>
<p><b>Typical problems and solutions:</b></p>
<br>
<i><b>INSTALLATION</b></i>:
<p>
A frequent problem is the installation of BioPerl, this in itself is not difficult if certain caveats are attended to, if you are familiar with [Installing Modules] then you are good to go. Note that there is some difference between the BioPerl [http://bioperl.open-bio.org/SRC/bioperl-live/INSTALL|Installation Requirements on Linux] and BioPerl [http://bioperl.open-bio.org/SRC/bioperl-live/INSTALL.WIN|Installation Requirements on Windows] and that not all of BioPerl is available on Windows hence you need to add the following repositories to the ActiveState PPM manager.
<ul>
<li>BioPerl-Release Candidates.</li>
<li>BioPerl-Regular Releases.</li>
<li>Kobes.</li>
<li>Bribes.</li>
</ul>
Adding these repositories is described in [http://bioperl.open-bio.org/SRC/bioperl-live/INSTALL.WIN|Installation Requirements on Windows] and If you are on a Windows OS then you might also want to check [PPM performs uneeded checks|PPM Repository Management] to enhance PPM efficiency after adding the above repositories.
</p>
<p><i>If anyone can contribute tips for other methods (e.g. Strawberry Perl and cpan?), it would be much appreciated!</i></p>
<p>
The BioPerl suite of modules revolves around sequence acquisition, parsing and retrieval from public databases and automating various tasks related to studying these sequences [http://www.bioperl.org/wiki/HOWTOs|BioPerl HOWTOs]. Think this is simple? Think again - <b>[http://code.google.com/p/bioperl/|CODE.GOOGLE.COM] tells us there are 3,666,478 lines of code to get your head round!</b>
</p>
<p>
A sequence is just a text string in a certain format (this format is described in the beginning of the text file containing that sequence) that represents either a gene or a protein, the alphabet of the sequence with regard to genes is but a combination of four letters (ACGT) and sometimes U (replaces T) and N (for aNything). A gene represents a sequence too, so doesn't negate the fact that it still has the aforementioned alphabet. ('<i>GATTACA</i>' is a sci-fi movie name that has these four letters). The Protein alphabet, on the other hand, comprise 20 letters.
</p>
<p>Often IO problems start with the sequence having non-canonical letters, punctuation, or whitespace left in from reading in the sequence, so [perldoc://perlretut] and [http://perldoc.perl.org/perlop.html#Regexp-Quote-Like-Operators|perlop] for help on regexes, and substitutions (<c>s///</c>) which are one way of checking for / replacing naughty characters.</p>
<p>
Working with either type of sequences (DNA or protein) can involve:
<ul>
<li><i>sequence comparison (Sequence Alignment):</i> two or more sequences are compared against each other to evaluate how similar they are, and where they are similar.</li>
<li>Sequence manipulation - in-place modification, concatenation, [perldoc://reverse] [ 197793|complimenting], etc…</li>
<li><i>BLASTing (Database Search for similar sequences):</i> </li> </ul>
</p>
<p><b>Modules of Interest: (Module Reviews Needed)</b>
<ul>
<li> [http://search.cpan.org/~cjfields/BioPerl-1.6.1/Bio/Seq.pm|Bio::Seq].</li>
<li>[http://search.cpan.org/~cjfields/BioPerl-1.6.1/Bio/SeqIO.pm|Bio::SeqIO] to access sequence files and perform I/O operations</li>
<li>[http://search.cpan.org/search?query=Bio::DB&mode=all|Bio::DB] and [http://search.cpan.org/search?query=Bio::DB::Query&mode=all|Bio::DB::Query] to either retrieve a single sequence from a database via its ID or ACCESSION Number or retrieve multiple sequences at a time by query objects containing search terms and criteria specific to the database under investigation.</li>
<li>[http://search.cpan.org/~cjfields/BioPerl-1.6.1/Bio/Tools/Run/StandAloneBlast.pm|Bio::Tools::Run::StandAloneBlast] to run the [http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=Download|local BLAST version] of the famous sequence analysis program.</li>
</p><p>There are also a host of extremely useful modules for general data handling.
<li>[http://search.cpan.org/search?mode=module&query=Math::Random|Math::Random] Often used a control.</li>
<li>[http://search.cpan.org/search?mode=module&query=Data::Validate|Data::Validate] Data munging tasks.</li>
<li>Other [http://search.cpan.org/search?mode=module&query=Statistics|Statistical] modules and [http://search.cpan.org/search?mode=module&query=Statistics::R|R] Modules.</li>
These last ones also highlight that bioinformatics is ultimately about ‘getting an answer’!
<li>[cpan://Benchmark], [cpan://Parallel::ForkManager], [cpan://Parallel::Forker] and [cpan://Devel::NYTProf].</li>
For those interested in getting the answer slightly faster!
</ul>
</p>
<p>
<li>Publicly available [http://www.bioperl.org/wiki/Bioperl_scripts|scripts].</li> </p> <p><b>Further Insight:</b>If you intend to develop libraries in BioPerl, a grip on [http://perlmonks.org/?node=Tutorials#Object-Oriented-Programming|Object Oriented Programming is mandatory]. </p>
<p>
<b>Got Data?</b>
</p><p>
So now you have a good start on the Perl side, but want some data to play with? Much of bioinformatics revolves around the integration of large datasets in an attempt to draw out relationships, ultimately giving biological meaning to observed phenomena.
</p>
<p>
Fortunately, biology naturally lends itself to informatics, with known hierarchies and interrelations mirroring OO structuring, and the sheer abundance of data makes the challenge interesting. Here are a few possible sources of publicly available data:
<ul>
<li>[http://www.ensembl.org|ENSembl]</li> Data can be accessed using either [cpan://Bio::Tools::Run::Ensembl|cpan tools] or the [http://www.ensembl.org/info/data/api.html|perl API]
<li>[http://www.biomart.org/]</li>
Providing access to a vast number of other databases.
<li>[http://eutils.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html|NCBI EUtilities]</li>
Again Perl based and a source of a huge amount of information. They also have a lot of documentation which explain about what the data <i>is</i>...
</ul>
Thanks to [erix] for suggestions.
</p>
<p>
<b>Currently available resources:</b>
</p><p>For both biologist and programmers, here are a few resources for those of you who want to read more.
<ul>
<li>[http://www.bioperl.org/wiki/Main_Page|BioPerl Official Site] and [http://www.bioperl.org/wiki/HOWTOs|HOWTOs].
</li>
This is obviously a major hub of information, and with the BioPerl API, a huge source of power. From the outside though, it can be intimidating.
<li>
[http://www.perl.com/pub/a/2002/01/02/bioinf.html|Beginning Perl for Bioinformatics] by James Tisdall. (Review Required).
</li>
<li>
[http://www.amazon.co.uk/Mastering-Perl-Bioinformatics-James-Tisdall/dp/0596003072/ref=pd_bxgy_b_img_b] by James Tisdall</li>
<li>[http://isbn.nu/0131008250| Bioinformatics Computing] by Bryan Bergeron </li>
<li>[http://isbn.nu/9781565926646| Developing Bioinformatics Computer Skills] by Cynthia Gibas and Per Jambeck
</li>
<li>[http://isbn.nu/9780596002992| BLAST] by Ian Korf, Mark Yandell and Joseph Bedell </li>
<li>[http://isbn.nu/9780321173867| Bioinformatics in the Post-Genomic Era Genome, Transcriptome, Proteome, and Information-Based Medicine] by Jeff Augen </li>
All of which (and possibly more) are available on [http://my.safaribooksonline.com/|O'Reilly's Safari] ( thanks to [planetscape] ).
<li>[http://www.osc.edu/supercomputing/training/bioperl/perl_bioinf_0411_pdf.pdf|Using Perl for Bioinformatics (PDF)] (Review Required).
</li>
<li>
[http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1000589|A Quick Guide for Developing Effective Bioinformatics Programming Skills].
</li>
<li>[http://www.perl.com/pub/a/2002/01/02/bioinf.html|James Tisdall on Perl.com]</li>
<li>[http://www.osc.edu/supercomputing/training/bioperl/perl_bioinf_0411_pdf.pdf|OSC tutorial]</li>
<li>[http://www.perl.com/pub/a/2005/10/20/scriptome.html?page=1|Data Munging on Perl.com]</li>
<li><i>[http://www.bioinfobooks.blogspot.com] - others, with reviews</i></li>
<li><i>Courses too…</i>[http://meetings.cshl.edu/courses/c-info10.shtml|Cold Spring Harbor's PFB]</li>
[BioLion]: this was my first major experience of bioinformatics, beyond university/college short courses, and was superb. The focus is very problem-oriented, and has a heavy emphasis on teaching you to teach yourself, which in the long run is the most important lesson.</p><p><i>Any other recommendations are welcome!</i>
</ul>
</p>
<p>
<b>Nodes of interest</b>
</p><p>Many great discussions have taken place in the monastery and this is just to highlight a few of the lessons learned there. [Super Search] will hopefully lead you to more specific answers too!
<ul>
<li>[BioPerl].</li>
<li>[Job Field - Bioinformatist].</li>
<li>[perl's long term place in bioinformatics?].</li>
</ul>
</p>
<p>
<b>Jobs</b>
</p><p>
Lastly, if you are really interested, there are several good forums / sites that advertise jobs within bioinformatics and related science. Personally, [BioLion|I] have found job-hunting to be no easy task, so here is a few of the better things I have stumbled upon:
<ul>
<li>[http://www.bioinformatics.fr/jobs.php]</li>
<li>[http://www.newscientistjobs.com/jobs/browse/biology_bioinformatics.htm] - also naturejobs, sciencejobs etc...</li>
<li>[http://www.123genomics.com/jobs.html]</li>
</ul>
</p>
<p>
<b>Further Insight:</b> If you can suggest ideas, invite/offer Modules review or share code addressing a certain aspect of BioPerl feel free to come forward with it.
</p>
<p><i>Thanks to [planetscape], [erix], [marto] and [GrandFather] for their contributions </i></p>
<b>END</b>
</p>
<p><b>Mediation / Tutorial i am thinking about RFCing:</b></p>
<p>Sharing variables and filehandles between processes</p>
<p>Resources:
<ul>
<li>[id://637089]</li>
<li>[id://7058]</li>
<li>[id://722663]</li>
<li>[id://678974]</li>
<li>[http://perldoc.perl.org/perlthrtut.html]</li>
<li>[id://648665]</li>
<li>[id://672403]</li>
</ul>
Some relevant modules :
<ul>
<li>[cpan://subs::parallel]</li>
<li>[cpan://Parallel::ForkManager]</li>
<li>[cpan://Parallel::Forker] - New!</li>
</ul></p>
<p>Perl cloud computing... (something for the future!)
<br />[http://www.nntp.perl.org/group/perl.p5ee/2008/05/msg1336.html]
</br>[http://www.slideshare.net/acme/living-in-the-cloud/
]
</p>
<p><b>To do:</b></p>
<ul>
<li>Comparing different options</li>
<li>Example code</li>
<p>End of meditation...</p>
<p><b>Custom sorting</b></p>
<p>[http://perldoc.perl.org/functions/sort.html|perldoc -f sort] and [http://perldoc.perl.org/sort.html|the sort pragma]</p>
<p>Very confusing mudules:</p>
<li>[cpan://Sort::External::Cookbook]</li>
<li>[cpan://Sort::Maker]</li>
<li>[cpan://Sort::MultipleFeilds]</li>
<li>[cpan://Sort::Key]</li>
<p>I find for most situations, rolling your own is the best approach:</p>
<c>
use warnings;
use strict;
my $aoh = [
{
first => 1,
second => ['foo', 'bar', 'baz',],
},
{
first => 2,
second => ['foz', 'barz',],
},
{
first => 2,
second => ['frip', 'barn', 'bazurt',],
},
{
first => 1,
second => ['foo',],
},
];
# use a subroutine to pass a custom
# sort code block
@$aoh = sort by_custom @$aoh;
######## subs ########
sub by_custom{
return ( # sort on first key - numeric ascending
( $a->{first} <=> $b->{first} )
||
( # then by the size of the second key
scalar@{$b->{second} }
<=>
scalar@{ $a->{second} }
)
);
}
</c>
<p><b>Some of my favourites</b></p>
<p>One of the nicer explanations out there : [id://778112]</p>
<p>Apart from being node 19000, i just think this is cool : [id://190000]</p>
<p><b>Others that I will look at again:</b></p>
<li>[id://479213] </li>
<li>[id://482919] </li>
<ol>(and if anyone has comments on the updated code : [id://780250])</ol>
<li>[id://481987]</li>
<li>[id://745674]</li>
<li>[id://29281]</li>
<p><b>Uncategorised as yet:</b></p>
<br />[id://789927]
<br />[id://552151]
<br />[id://794216]
<br />[id://794303]
<br />[id://774421]
<br />[id://748175]
<br />[id://516706]
<br />[id://729070]
<br />[id://795164]
<br />[id://797136]
<br />[id://793984]
<br />[id://87628]
<br />[id://481745]
<br />[id://591547]
<br />[id://799081]
<br />[id://799614]
<br />[id://807559]
<br />[id://804383]
<br />[id://823756]
<br />[id://527357]