Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Re: Filter and writing error log file

by Laurent_R (Canon)
on Jul 22, 2014 at 21:44 UTC ( [id://1094682]=note: print w/replies, xml ) Need Help??


in reply to Filter and writing error log file

Could it be done while reading the file in while loop ( tas I have tried below in the code)?

Yes, by all means, not only it can be done this way, but this is most often the best way you can do it (i.e. when the current data under examination doesn't need to know about the previous or next data chunk to be validated, examined or used). Especially with DNA files which, as far as I know(I am not a bio guy), can be very large. Using a while loop on your file (i.e. a file iterator) makes it possible for you to read absolutely huge files without ever running into out-of-memory problems (it might take time, but at least you have a very high probability of running your program to the end).

I am working almost daily with huge files (typically between 3 and 15 GB, sometimes as much as 200 GB). In such cases, slurping the file into memory is just not an option, my program would die. Reading it with an iterator (a 'while (my $line = <$IN>) {' type of construct) is the only solution, and it does not use more memory than the size needed for the longest line.

I had a relatively similar problem over the last days and found the solution this morning. A proprietary database (with no Perl module/driver), but implementing a protocol similar to SQL. I needed to load a rather large quantity of data into memory, and then process the main table using the data in memory. My first attempt last week ran out of memory and the program crashed. Before trying to load the main table, my original program was already using 155,000 blocks of memory (my best guess is that a memory block is 8 kB, but not sure). Anyway, after having loaded 155,000 blocks, trying to load the main table failed for lack of memory. After some experimentation, I was able to reduce memory consumption (changing hashes of hashes to hashes of strings), but the main improvement was to be able to use an iterator on the main table with a syntax as follows:

open my $invoice_lines, "-|", $command or die "could not fork $!"; while (my $inv_line = <$invoice_lines>) { #...
Having done these changes, my program is never using more than 62,000 blocks, so that it can be considered as fairly safe.

Other comments on your code: your identifiers are very poor, X, Y and A don't say anything about the content. Similarly, $a, $t, $g and $c may seem OK when talking about DNA, I would suggest you use at least two letters for better identification. Also, the $a (and also $b) variable has a special meaning in Perl (used especially for sorting) and should probably be avoided for other purposes.

Not sure I answered your question, but not sure about what your question really was. I hope that I gave at least some indications.

Replies are listed 'Best First'.
Re^2: Filter and writing error log file
by newtoperlprog (Sexton) on Jul 23, 2014 at 13:08 UTC

    Dear All,

    Thank you very much for your time and suggestions.

    I agree Laurent R that these DNA files could be very long and loading them in an array at the beginning could pose a memory problems. This is the reason why I am trying to read the file in a while loop and checking for conditions. I tried to use if condition in the program

    if (($seq =~/[A|T|G|C]/) && ($lenseq == 19)) { print "$seq\n"; } else {print "error log file";} # here I want to print those fragem +nts whose length is either less than or greater than 19 and if the fr +agments contains based other than [ATGC]

    All this in a while loop so that I can read huge files without worrying about the memory issues.

    Could it be possible to get some directions as to how to check those condition and only if the conditions are true the sequences are processed further.

    Thanks to all of you

      To check that a string contains something other than A, C, T, or G, search for the offending character, so in your condition, use
      $seq !~ /[^ACTG]/

      Note that | is not needed in a character class (in fact, it matches literally, so avoid it if you don't want to match it).

      لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ

        Thanks for the suggestions. One question, why we have to use '^' to match rather than

        [ATGC]

        I have one loop related question. I have defined an array of alphabet from ("A" .. "Z") but after reading a long file the alphabets end and the program shows error of uninitialized values.

        My questions is how can I define an array of alphabets which can go to AA, BB, CC and ...so on when the "A" .. "Z" ends.

      Hi, you could have a number of next statements to discard records that are not good. For example:
      while (<$IN_FILE>) { chomp; next if /[^ACTG]/; # removes lines with other letters next if length != 19; # removes lines not 19 char long # I just made up the next rule for the example next if /(.)\1\1/; # removes lines where the same letter comes th +ree times in a row # etc. # now start doing the real processing # ... }
      The next statement goes directly to the next iteration of the while loop, so that faulty lines are effectively discarded early in the process.
        Thanks Laurent_R for the suggestion :-)

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1094682]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others meditating upon the Monastery: (4)
As of 2024-04-20 04:15 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found