Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change

Using less memory with BIG files

by jemswira (Novice)
on Feb 02, 2012 at 07:42 UTC ( #951371=perlquestion: print w/replies, xml ) Need Help??
jemswira has asked for the wisdom of the Perl Monks concerning the following question:

Hi guys. So I have tried to make a database combining two files of data. One is accesionnumbersfull.txt

A0AQI4 A0AQI5 A0AQI7 .....

the other is this Pfam-A.seed

# STOCKHOLM 1.0 #=GF ID 1-cysPrx_C #=GF AC PF10417.4 #=GF DE C-terminal domain of 1-Cys peroxiredoxin #=GF AU Finn RD, Coggill PC #=GF SE Gene3D, pdb_1prx ... #=GS A3EU39_9BACT/160-195 AC A3EU39.1 #=GS Q7VQB3_BLOFL/159-194 AC Q7VQB3.1 #=GS Q057V5_BUCCC/160-195 AC Q057V5.1 #=GS A5CDZ8_ORITB/160-195 AC A5CDZ8.1 ...

So what i'm supposed to do is to match the numbers in the first file to the groups in the second. so the group name is after the #=GF AC. PFxxxxx Problem is the files are huge. the first file alone is 138mb. So i have memory issues. my code is as follow.

#!/usr/bin/perl use warnings; use strict; open OUTPUT, ">C:\\Users\\Jems\\Desktop\\Perl\\PFAMin.txt" or die $!; open ANUMBER, "C:\\Users\\Jems\\Desktop\\Perl\\AccessionNumbersfull.tx +t" or die $!; our @acnumbers; select OUTPUT; $|=1; foreach (<ANUMBER>){ chomp; push (@acnumbers, $_);} $/="\/\/"; our $acnumbers; our @list; foreach $acnumbers(@acnumbers){ open PFAMDB, "C:\\Users\\Jems\\Desktop\\Perl\\Pfam-A.seed" or die +$!; my $unit; foreach $unit(<PFAMDB>){ my @units= split /#/,$unit; my @pfx=grep(/=GF AC/,@units); foreach (@pfx){s/=GF AC/\x20/}; our $units; foreach $units(@units){ if ($units=~/.*AC $acnumbers/){ push (@list, @pfx); }else{next} } } print "$acnumbers is in:"; print "@list \n"; undef @list; }

anyway to streamline it?

another thing i needed to do is add the names corresponding to the numbers. those are in a seperate file, but the sequence is the same. i took the numbers out of that file. format:

>tr|A0FGZ9|A0FGZ9_9ARCH Methyl coenzyme M reductase (Fragment) OS=uncu +ltured archaeon GN=mcrA PE=4 SV=1 >tr|A0FH03|A0FH03_9ARCH Methyl coenzyme M reductase (Fragment) OS=uncu +ltured archaeon GN=mcrA PE=4 SV=1

but i don't know how to. any ideas? thanks!!

sorry but it's kinda urgent and i've been trying for ages!

Replies are listed 'Best First'.
Re: Using less memory with BIG files
by moritz (Cardinal) on Feb 02, 2012 at 08:11 UTC

    There are a number of things you can improve. The two most important are:

    1. If you do foreach $unit(<PFAMDB>) { } it reads the whole file into memory first, then iterates over it. If you instead write while ($unit = <PFAMDB>) { ... }, the file is read line by line.

    2.Instead of doing a nested loop, read the IDs into a hash first, and then extract the IDs from the second file, and look them up in the hash. That will greatly speed up things.

      for the second part, I store each line as a hash key, then I don't really know what to do after that?

      would it make more sense if i set each group to a hash with the key being the group name and the value being the number? problem is there would be repeats for the IDs so i don't know what can i do.

      sorry but im still kinda new to this D:

Re: Using less memory with BIG files
by sundialsvc4 (Abbot) on Feb 02, 2012 at 14:09 UTC

    Extending on Moritz’s idea a little bit more, another trick is to scan the file stem-to-stern once, noting where the “important pieces” begin and end, and what the “key values” are that you will use when searching for those records.   Insert the keys into a hash, with a file-position (or a list of file-positions) as the value.   Then, after this one sequential pass through the entire file, you can seek() randomly to those positions at any time thereafter.   (If along the way you have noted both the starting-position and the size of the entry, you can “slurp” any particular record into, say, a string variable fairly effortlessly.)   This is a useful technique to apply to files that are “loosely” structured, as this one seems to be.

    Now, if you happen to know that the two files are sorted, and specifically that they are sorted the same way ... if you can positively assert based on some outside knowledge that this is true, and that this always will be true, with regard to these files ... then your logic becomes a good bit simpler because you can simply read the two files sequentially and do everything in just one forward pass, just as they used to do when the only mass-storage device of any reasonable size that you had at your disposal was a tape-drive.   It would be too-messy to sort them yourself, and maybe you do not want to risk that they might be, ahem, “out of sorts,” but it’s a handy trick to use (and, bloody fast ...) when you know that you can.

      So from what i see, i should be taking the IDs from the Pfam-A.seed file and putting them in a hash. but there's two parts to the important info from the Pfam-a.seed file. the first is the ID, the second is the group name. There's like a 1000 groups in the file, and several million IDs in the first file. Wouldnt memory be a problem?

      Ok to be 100% honest, I don't fully understand everything going on now. Would you mind guiding me a bit here? Sorry, but I only learned started Perl recently

Re: Using less memory with BIG files
by GrandFather (Sage) on Feb 02, 2012 at 20:24 UTC

    I don't see any "make a database" in there. Do you mean you want to use two existing files as a database, using one as a key column for example? Or do you mean that you want to take the data from two existing files and generate a database from them? Or maybe you mean something else?

    In any case, we can probably help you more if you show us just a little more of the code, especially the output part. Even just making clear what you want to achieve end to end would help a lot.

    As an aside, don't use our, it's not doing what you think or what you want. Use my instead. $acnumbers and @list should be defined (using my) inside the for loop - the undef is not needed then.

    It may be that your code does show the output you really want (I missed the possibility due to using select instead of print $outFile ...). BTW, did I mention you should always use three parameter open and lexical file handles? You should!

    In any case if you want a database use one - I'd suggest SQLite in this sort of context.

    True laziness is hard work

      Well actually what i want is in this format:

      Q8K9W0 | name |PF10417.4/ PF10425.1

      the name is in a third file that also has the IC number, the Q8K9W0 part. its the one with this format:

      >tr|A0FGZ9|A0FGZ9_9ARCH Methyl coenzyme M reductase (Fragment) OS=uncu +ltured archaeon GN=mcrA PE=4 SV=1 >tr|A0FH03|A0FH03_9ARCH Methyl coenzyme M reductase (Fragment) OS=uncu +ltured archaeon GN=mcrA PE=4 SV=1

      so that's what i need. also, what is a three parameter open and lexical file handles? all i know i learnt from the first 6 chapters of Beginning Perl by Simon Cozen

        The answer to both your questions happen to be in the open link I gave you. Aside from asking for help here, it is worth finding out how to use the Perl documentation that almost certainly was installed with your Perl. Try typing perldoc from your command line.

        True laziness is hard work

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://951371]
Approved by marto
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others exploiting the Monastery: (7)
As of 2017-01-23 11:50 GMT
Find Nodes?
    Voting Booth?
    Do you watch meteor showers?

    Results (192 votes). Check out past polls.