Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Making a hash with groups of IDs

by jemswira (Novice)
on Feb 05, 2012 at 17:47 UTC ( #951965=perlquestion: print w/ replies, xml ) Need Help??
jemswira has asked for the wisdom of the Perl Monks concerning the following question:

I have this project that's kinda urgent and I still have not much idea how to do it. What i'm supposed to do, is to take a file that looks like this:

# STOCKHOLM 1.0 #=GF ID 1-cysPrx_C #=GF AC PF10417.4 #=GF DE C-terminal domain of 1-Cys peroxiredoxin #=GF AU Finn RD, Coggill PC #=GF SE Gene3D, pdb_1prx ... #=GS A3EU39_9BACT/160-195 AC A3EU39.1 #=GS Q7VQB3_BLOFL/159-194 AC Q7VQB3.1 #=GS Q057V5_BUCCC/160-195 AC Q057V5.1 #=GS A5CDZ8_ORITB/160-195 AC A5CDZ8.1 ... // Similar set of data with different numbers.

so the final file i need is one with this:

A3EU39 | PF10417.4/ PF10000.3 Q7VQB3 | PF10417.4/...

I also have a file with a list of these 6digit numbers/letters and i have to arrange them in that order. At first i wanted to go through the file each time, but both files are large so i have to try and just run once through. So i was thinking a hash. like a hash with Q7VQB3->PF10417.4/.....

But to be honest, i have no idea how to. I'm sorry but i'm really new. i was thinking

my %hash; open PFAMDB, "C:\\Users\\Jems\\Desktop\\Perl\\Pfam-A.seed" or die $!; #thats the main file. while (my $pfam=<PFAMDB>){ my @units= split /#/,$pfam; if ($pfam=~ =GF AC){my $pf=$pfam;} if ($pfam=~ \sAC\s){if exists $hash{$pfam}{$hash{$pfam}=$pf} else .....

so this is where i get lost. can i push a new value to the end? also, will this work in the first place? Im sorry. but i'm still new D: please help me? Thanks!

Comment on Making a hash with groups of IDs
Select or Download Code
Re: Making a hash with groups of IDs
by moritz (Cardinal) on Feb 05, 2012 at 17:58 UTC
Re: Making a hash with groups of IDs
by RichardK (Priest) on Feb 05, 2012 at 18:48 UTC

    Reading the perldsc - Perl Data Structures Cookbook, should help answer part of your question, but I think your pattern matching is confused. Well I can't tell what you're trying to do anyway ;)

Re: Making a hash with groups of IDs
by CountZero (Bishop) on Feb 05, 2012 at 21:20 UTC
    Easy:
    use Modern::Perl; use Data::Dump qw/dump/; my %data; my $ac; while (<DATA>) { if (/^#=GF AC\s+(.*)$/) { $ac = $1; next; } if (/^#=GS ([^_]*)/) { push @{ $data{$1} }, $ac; next; } } say dump(%data); __DATA__ # STOCKHOLM 1.0 #=GF ID 1-cysPrx_C #=GF AC PF10417.4 #=GF DE C-terminal domain of 1-Cys peroxiredoxin #=GF AU Finn RD, Coggill PC #=GF SE Gene3D, pdb_1prx ... #=GS A3EU39_9BACT/160-195 AC A3EU39.1 #=GS Q7VQB3_BLOFL/159-194 AC Q7VQB3.1 #=GS Q057V5_BUCCC/160-195 AC Q057V5.1 #=GS A5CDZ8_ORITB/160-195 AC A5CDZ8.1 ... // # LONDON 1.0 #=GF ID 1-cysPrx_C #=GF AC PF10000.3 #=GF DE C-terminal domain of 1-Cys peroxiredoxin #=GF AU Finn RD, Coggill PC #=GF SE Gene3D, pdb_1prx ... #=GS A3EU39_9BACT/160-195 AC A3EU39.1 #=GS Q7VQB8_BLOFL/159-194 AC Q7VQB3.1 #=GS Q057V5_BUCCC/160-195 AC Q057V5.1 #=GS A5CDZ8_ORITB/160-195 AC A5CDZ8.1 // # AMSTERDAM 1.0 #=GF ID 1-cysPrx_C #=GF AC PF10999.3 #=GF DE C-terminal domain of 1-Cys peroxiredoxin #=GF AU Finn RD, Coggill PC #=GF SE Gene3D, pdb_1prx ... #=GS A3EU39_9BACT/160-195 AC A3EU39.1 #=GS Q7VQB8_BLOFL/159-194 AC Q7VQB3.1 #=GS Q057V5_BUCCC/160-195 AC Q057V5.1 #=GS A5CDZ8_ORITB/160-195 AC A5CDZ8.1
    Output:

    ( "Q7VQB8", ["PF10000.3", "PF10999.3"], "Q7VQB3", ["PF10417.4"], "A3EU39", ["PF10417.4", "PF10000.3", "PF10999.3"], "Q057V5", ["PF10417.4", "PF10000.3", "PF10999.3"], "A5CDZ8", ["PF10417.4", "PF10000.3", "PF10999.3"], )

    The formatting of the output is left as an exercise for the reader.

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

Re: Making a hash with groups of IDs
by GrandFather (Cardinal) on Feb 06, 2012 at 21:40 UTC

    This is the fourth in a series of questions relating to what appears to be the same project. Maybe it is time to stand back a little and describe your overall project rather than have us continually trying to guess what you are trying to achieve and squeezing information out of you a single small drop at a time?

    So far we have some information concerning the format of a couple of input files. We know that at least one of these is big. We know that you are selecting some data based on some other data. We know there is a third file involved.

    We don't know what you are trying to achieve in a "big picture" way. We don't know if this is a one off. If this is not a one off we don't know how the input data changes over time. We don't know if you need to perform multiple searches with the same data.

    You seem to focus on answering a few of the questions you've been asked and you seem to be looking for a quick fix solution to what is probably a small part of the problem. The more we know about the high level problem the more we can offer ways to address the big issues.

    True laziness is hard work

      So my entire project is to use LIBSVM to predict a protein's function. What I have been doing, is taking the Protein database's list of protein numbers (accession numbers, the 6 digit things) and matching them to their PF numbers, which is in what PF groups. The two files i have been using are up there. Now I have managed to get the data in the following format:

      B3T3Y0 | PF02517.11 B3T4D5 | PF13371.1 PF13369.1 B3T4G0 | PF13607.1 B3T516 | PF08438.5 B3T517 | PF13207.1 PF13238.1 B3T644 | PF14382.1 B3T662 | PF13248.1 B3T663 | PF13248.1 PF13248.1 PF13240.1

      which is what i actually wanted all along. Thanks Count Zero. What i don't know is what to do to put it into LIBSVM, but my groupmate is looking into that. Thanks everyone though!

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://951965]
Approved by moritz
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others contemplating the Monastery: (6)
As of 2014-12-28 20:29 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (182 votes), past polls