Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

Hash help for biologist

by utterlyconfused (Novice)
on Jan 10, 2011 at 19:04 UTC ( [id://881525]=perlquestion: print w/replies, xml ) Need Help??

utterlyconfused has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I am a beginner in Perl and I have an exam coming up. I need help with the following question from a past exam paper please.
Given an infile called seqs.fs and containing the following hypothetical FASTA formatted sequences:

>gi|209483500:3405-4275 [Homo sapiens] gene X, complete
ATACCCATGGCCAACCTCCTACTCCTCATTGTACCCATTCTAATCGCAATGGCATTCCTAATGCTT

>gi|209483501:3307-4262 [Pan paniscus] gene X, complete
AACGAAAAATTCTAGGCTATATACAACTACGCAAAGGCCCCAACGTTGTAGGCCCCTACGGGCTACTACA

>gi|209483502:3600-4187 [Mus musculus] gene X, complete
TAAACACCCTCACCACTACAATCTTCCTAGGAACAACATATGACGCACTCTCCCCTGAACTCTACACAACGGATGCT

>gi|209483502:3600-4187 [Canis familiaris] gene X, complete
ATATTTTGTCACCAAGACCCTACTTCTAACCTCCCTGTTCTTATGAATTCGAACAGCATACCCCCGATTC

>gi|209483502:3600-4187 [Rattus norvegicus] gene X, complete
CGCTACGACCAACTCATACACCTCCTATGAAAAAACTTCCTACCACTCACCCTAGCATTACTTATATGATCGATTTT

Write a program that will:
a) Read in seqs.fs and store the sequences in an HASH.
b) Loop through the HASH and write out an alphabetically sorted OUTFILE in which the name of the sequences are reduced to the species name only (the names between square brackets)#
e.g. for the first one we want ">gi|209483500:3405-4275 [Homo sapiens] gene X, complete" reduced to ">Homo sapiens" (could you possibly show me how to do it this way and also if i were searching for a motif in each sequences to just print the title line).
c) Print to STDOUT the GC content of each sequence.
Try to use subroutines if possible.

The GC content is the amount of G's and C's put together.

any help would be greatly appreciated

Replies are listed 'Best First'.
Re: Hash help for biologist
by toolic (Bishop) on Jan 10, 2011 at 19:23 UTC
Re: Hash help for biologist
by ELISHEVA (Prior) on Jan 10, 2011 at 19:26 UTC

    Any help would be greatly appreciated.

    We obviously can't answer your exam question for you. If you have specific questions about Perl, we're glad to help, but with such a broad request the most I can offer is a problem solving strategy.

    • Make a list of all the data you have at the start. What is the best way to organize that data? Does it belong in a hash? An array? Something more complicated? Not Sure? Make a guess. perldata, perlreftut, perldsc can give you some idea of your options for organizing your data.
    • Make a list of all the data you want to have at the end. How does it need to be organized into arrays, hashes, etc in order to print out the results?
    • Start brainstorming about how to transform the data you have at the start into the data you need at the end. Try to spell out what you need to do in a list of steps.
    • Take a look at your starting data, do you need to change the way you organized it to make those steps easier to carry out?
    • Start coding. Keep reviewing your list of steps. Do they still make sense? How do they need to change? Take another look at the way you are organizing your data. Does that still make sense? Does it need to change to make it easier to work with? How?
    • When you code, try to think about how you can test your work so far. Does your code do what you think it does? Testing is very easy with Perl. Take a look at Test::Simple and Test::More to see how simple.
    • Does the code you've written so far compile? Are you using strict and warnings? I often find cleaning up compilation errors as I go, helps me quickly see problems in my original data structures and plan of action.

    Those are just a few ideas to get you started. Best of luck with this exam question.

    You might also find this node helpful Perl and Bioinformatics.

Re: Hash help for biologist
by umasuresh (Hermit) on Jan 10, 2011 at 19:34 UTC
Re: Hash help for biologist
by AR (Friar) on Jan 10, 2011 at 19:13 UTC
    Please show us how far you've gotten, and we can help you with any mistakes and misconceptions.
Re: Hash help for biologist
by planetscape (Chancellor) on Jan 11, 2011 at 01:54 UTC

    Perhaps something in Not Exactly a Hash Tutorial will give you a nudge in the right direction, at least as far as "looping through a hash" is concerned.

    HTH,

    planetscape
Re: Hash help for biologist
by tj_thompson (Monk) on Jan 10, 2011 at 19:19 UTC
    Agree with AR above, but also have a question. Is the format of your file as shown above? Or are just the sequences in the file? You mention that the file will contain the following information, but not whether what is shown is actually what is in the file.
Re: Hash help for biologist
by biohisham (Priest) on Jan 11, 2011 at 09:04 UTC
    Search through the website, there are many similar scenarios that you can learn and develop your approach from, we really value folks trying to do their part and then when they're spent for resources they come and ask for directions/opinions/approaches, this way the learning process can become more interactive and lasting. I could get you the following links to nodes posted withing the Monastery to get you started, these are replete in discussions covering the basic areas of the functionality your seek . Soon as you are able to read a file then you can do many things with the lines that you read involving creating hashes. Here in your situation, you have two ways to go, you can either directly use Perl or use BioPerl, the later offers - at the price of time - good libraries to manipulate sequences in a systematic way...

    Try your best and we are ready to pitch in as soon as you do your part on the side of learning...


    Excellence is an Endeavor of Persistence. A Year-Old Monk :D .
Re: Hash help for biologist
by jwkrahn (Abbot) on Jan 10, 2011 at 19:37 UTC
    $ echo ">gi|209483500:3405-4275 [Homo sapiens] gene X, complete ATACCCATGGCCAACCTCCTACTCCTCATTGTACCCATTCTAATCGCAATGGCATTCCTAATGCTT >gi|209483501:3307-4262 [Pan paniscus] gene X, complete AACGAAAAATTCTAGGCTATATACAACTACGCAAAGGCCCCAACGTTGTAGGCCCCTACGGGCTACTACA >gi|209483502:3600-4187 [Mus musculus] gene X, complete TAAACACCCTCACCACTACAATCTTCCTAGGAACAACATATGACGCACTCTCCCCTGAACTCTACACAAC +GGATGCT >gi|209483502:3600-4187 [Canis familiaris] gene X, complete ATATTTTGTCACCAAGACCCTACTTCTAACCTCCCTGTTCTTATGAATTCGAACAGCATACCCCCGATTC >gi|209483502:3600-4187 [Rattus norvegicus] gene X, complete CGCTACGACCAACTCATACACCTCCTATGAAAAAACTTCCTACCACTCACCCTAGCATTACTTATATGAT +CGATTTT " | perl -ne'$hash{$1}=<>=~tr/CG//if/\[([^][]+)]/}{print map">$_\n$has +h{$_}\n\n",sort keys%hash' >Canis familiaris 30 >Homo sapiens 30 >Mus musculus 36 >Pan paniscus 33 >Rattus norvegicus 31

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://881525]
Approved by moritz
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others contemplating the Monastery: (2)
As of 2025-03-24 00:38 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    When you first encountered Perl, which feature amazed you the most?










    Results (63 votes). Check out past polls.