Re: Filter and writing error log file

Could it be done while reading the file in while loop ( tas I have tried below in the code)?

Yes, by all means, not only it can be done this way, but this is most often the best way you can do it (i.e. when the current data under examination doesn't need to know about the previous or next data chunk to be validated, examined or used). Especially with DNA files which, as far as I know(I am not a bio guy), can be very large. Using a while loop on your file (i.e. a file iterator) makes it possible for you to read absolutely huge files without ever running into out-of-memory problems (it might take time, but at least you have a very high probability of running your program to the end).

I am working almost daily with huge files (typically between 3 and 15 GB, sometimes as much as 200 GB). In such cases, slurping the file into memory is just not an option, my program would die. Reading it with an iterator (a 'while (my $line = <$IN>) {' type of construct) is the only solution, and it does not use more memory than the size needed for the longest line.

I had a relatively similar problem over the last days and found the solution this morning. A proprietary database (with no Perl module/driver), but implementing a protocol similar to SQL. I needed to load a rather large quantity of data into memory, and then process the main table using the data in memory. My first attempt last week ran out of memory and the program crashed. Before trying to load the main table, my original program was already using 155,000 blocks of memory (my best guess is that a memory block is 8 kB, but not sure). Anyway, after having loaded 155,000 blocks, trying to load the main table failed for lack of memory. After some experimentation, I was able to reduce memory consumption (changing hashes of hashes to hashes of strings), but the main improvement was to be able to use an iterator on the main table with a syntax as follows:

open my $invoice_lines, "-|", $command or die "could not fork $!";
while (my $inv_line = <$invoice_lines>) {
     #...
[download]

Having done these changes, my program is never using more than 62,000 blocks, so that it can be considered as fairly safe.

Other comments on your code: your identifiers are very poor, X, Y and A don't say anything about the content. Similarly, $a, $t, $g and $c may seem OK when talking about DNA, I would suggest you use at least two letters for better identification. Also, the $a (and also $b) variable has a special meaning in Perl (used especially for sorting) and should probably be avoided for other purposes.

Not sure I answered your question, but not sure about what your question really was. I hope that I gave at least some indications.

Comment on Re: Filter and writing error log file Select or Download Code

Replies are listed 'Best First'.
Re^2: Filter and writing error log file by newtoperlprog (Sexton) on Jul 23, 2014 at 13:08 UTC
Dear All, Thank you very much for your time and suggestions. I agree Laurent R that these DNA files could be very long and loading them in an array at the beginning could pose a memory problems. This is the reason why I am trying to read the file in a while loop and checking for conditions. I tried to use if condition in the program `if (($seq =~/[A\|T\|G\|C]/) && ($lenseq == 19)) { print "$seq\n"; } else {print "error log file";} # here I want to print those fragem +nts whose length is either less than or greater than 19 and if the fr +agments contains based other than [ATGC]` [download] All this in a while loop so that I can read huge files without worrying about the memory issues. Could it be possible to get some directions as to how to check those condition and only if the conditions are true the sequences are processed further. Thanks to all of you	[reply] [d/l]
Re^3: Filter and writing error log file by choroba (Cardinal) on Jul 23, 2014 at 13:29 UTC
To check that a string contains something other than A, C, T, or G, search for the offending character, so in your condition, use `$seq !~ /[^ACTG]/` [download] Note that \| is not needed in a character class (in fact, it matches literally, so avoid it if you don't want to match it). لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ	[reply] [d/l]
Re^4: Filter and writing error log file by newtoperlprog (Sexton) on Jul 23, 2014 at 13:44 UTC
Thanks for the suggestions. One question, why we have to use '^' to match rather than `[ATGC]`	[reply] [d/l]
Re^5: Filter and writing error log file by choroba (Cardinal) on Jul 23, 2014 at 13:52 UTC
Re^6: Filter and writing error log file by newtoperlprog (Sexton) on Jul 23, 2014 at 19:55 UTC
Some notes below your chosen depth have not been shown here
Re^4: Filter and writing error log file by newtoperlprog (Sexton) on Jul 23, 2014 at 15:20 UTC
I have one loop related question. I have defined an array of alphabet from ("A" .. "Z") but after reading a long file the alphabets end and the program shows error of uninitialized values. My questions is how can I define an array of alphabets which can go to AA, BB, CC and ...so on when the "A" .. "Z" ends.	[reply]
Re^5: Filter and writing error log file by choroba (Cardinal) on Jul 23, 2014 at 15:26 UTC
Re^5: Filter and writing error log file by newtoperlprog (Sexton) on Jul 23, 2014 at 17:14 UTC
Re^3: Filter and writing error log file by Laurent_R (Canon) on Jul 23, 2014 at 18:00 UTC
Hi, you could have a number of `next` statements to discard records that are not good. For example: `while (<$IN_FILE>) { chomp; next if /[^ACTG]/; # removes lines with other letters next if length != 19; # removes lines not 19 char long # I just made up the next rule for the example next if /(.)\1\1/; # removes lines where the same letter comes th +ree times in a row # etc. # now start doing the real processing # ... }` [download] The `next` statement goes directly to the next iteration of the `while` loop, so that faulty lines are effectively discarded early in the process.	[reply] [d/l] [select]
Re^4: Filter and writing error log file by newtoperlprog (Sexton) on Jul 23, 2014 at 20:31 UTC
Thanks Laurent_R for the suggestion :-)	[reply]


go ahead... be a heretic
	PerlMonks