Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister

Searching file and printing

by jemswira (Novice)
on Dec 30, 2011 at 15:17 UTC ( #945647=perlquestion: print w/replies, xml ) Need Help??
jemswira has asked for the wisdom of the Perl Monks concerning the following question:

Ok so for this research project, I have a file, with data arranged like so:


#=GF ID 1-cysPrx_C

#=GF AC PF10417.4

#=GF DE C-terminal domain of 1-Cys peroxiredoxin


#=GS D8BPP0_ECOLX/154-186 AC D8BPP0.1

#=GS D6I5T0_ECOLX/154-186 AC D6I5T0.1




It's basically proteins and functional groups. The functional groups are the ones in #=GF AC PFxxxx, and the proteins are the ones with #=GS D8BPP0.

so the list would have like, D8BPPO is in groups :PFxxxxx etc etc

I thought i would put the list of proteins into an array (they're in a big file) and then I'd put each protein into a scalar. Then I'd read the 2nd file, with all the data up there, with $/="\/\/"; and then split it using #. Then i'd check if it was the functional group using the grep function, then check if the protein was in the functional group. if it was, then i'd push the functional group into an array, and then at the end of the loop i'd print it out, and then go on to the next protein.

example with simplified list of proteins:

$/="\/\/"; our @acnumbers=qw(P0A252 Q9AT80 Q0HKB6); our $acnumbers; foreach $acnumbers(@acnumbers){ my $unit; foreach $unit(<PFAMDB>){ my @units= split /#/,$unit; my @pfx=grep(/=GF AC/,@units); our $units; foreach $units(@units){ if ($units=~/.*AC $acnumbers/){ push (@list, @pfx); }else{next} } } print "$acnumbers is in:"; print @list; undef @list; }

But all i get is

P0A252 is in:=GF AC PF10417.4

Q9AT80 is in:Q0HKB6 is in:

how should i improve it?

sorry for the messiness but i really just learned perl

Replies are listed 'Best First'.
Re: Searching file and printing
by MidLifeXis (Monsignor) on Dec 30, 2011 at 15:24 UTC

    It appears that you are intending to read through the file once for each protein in @acnumbers. You may be better off opening PFAMDB inside of your foreach $acnumbers... loop, or doing a seek to the beginning of the file prior to the inner foreach loop.

    The way it is currently written, PFAMDB is being read once to the end of file, where it remains for every other protein in @acnumbers.

    You may also want to explore BioPerl if you have not already.


      OHHH thanks! it works now.

      also, how would you remove the =GF AC before printing? Would you use s/=GF AC/ / or something?

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://945647]
Approved by ww
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others exploiting the Monastery: (5)
As of 2018-06-24 21:26 GMT
Find Nodes?
    Voting Booth?
    Should cpanminus be part of the standard Perl release?

    Results (126 votes). Check out past polls.