Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Seeking help for using Tie : : File in my perl program for counting bases using Active Perl 5.10.1 Build 1007

by supriyoch_2008 (Scribe)
on Jan 30, 2012 at 05:56 UTC ( #950702=perlquestion: print w/ replies, xml ) Need Help??
supriyoch_2008 has asked for the wisdom of the Perl Monks concerning the following question:

Hi Perl Monks,

I am a beginner in perl programming. I have written a PERL program to count the number of bases in a DNA molecule from a text file in command prompt in Windows XP. The program works well with small text files having DNA sequence data and gives correct results.

But when I tried to count the number of bases from a 299 MB text file, the program shows “out of memory” in cmd after nearly 2 minutes. One Perl monk suggested me to use Tie: : File to solve this problem. I looked for the syntax of Tie : : File in internet but could not make out how to use it in my program with  <STDIN> input operator assigned to a scalar variable like $DNAfilename=<STDIN>;

I have given the perl program below. May I expect to get help from any perl monk to sort out this problem in my program either using Tie : : File or other method so that I can analyze a 299MB text file. I have a 2-GB RAM in my laptop with Active Perl 5.10.1 Build 1007 version.

Can I use a code like  %mem=300 MW; in perl because this type of code is used by some theoretical and physical chemists in specific programs based on c programming for “Quantitative Structure Analysis and Report (QSAR)” for biomolecules to solve memory problem?

#!usr/bin/perl print "\n\nPlease type the filename of the DNA sequence data: "; $DNAfilename=<STDIN>; chomp $DNAfilename; unless ( open(DNAFILE, $DNAfilename) ) { print "Cannot open file \"$DNAfilename\"\n\n"; exit; } while(@DNA= <DNAFILE>) { $DNA=join('',@DNA); close DNAFILE; # Remove whitespace $DNA=~ s/\s//g; # Remove whitespace $DNA=~ s/\s//g;# Line 15 # Count number of bases $b=length($DNA); print "\nNumber of bases: $b.\nDoes the value tally with GenBank recor +d? If yes,continue."; # Count number of each base and nonbase $A=0;$T=0;$G=0;$C=0;$e=0; # Line 20 while($DNA=~ /A/ig){$A++} while($DNA=~ /T/ig){$T++} while($DNA=~ /G/ig){$G++} while($DNA=~ /C/ig){$C++} while($DNA=~ /[^ATGC]/ig){$e++} # Line 25 print "\nA=$A; T=$T; G=$G; C=$C; Errors(N)=$e.\n."; } exit;

I have given the cmd output below.

Microsoft Windows XP Version 5.1.2600

(C) Copyright 1985-2001 Microsoft Corp.

C:\Documents and Settings\user>cd d*

C:\Documents and Settings\user\Desktop>m.pl

Please type the filename of the DNA sequence data: manjur.txt

Out of memory!

C:\Documents and Settings\user\Desktop>

I am ever grateful to perl monks for their quick reply with suggestions.

Comment on Seeking help for using Tie : : File in my perl program for counting bases using Active Perl 5.10.1 Build 1007
Select or Download Code
Re: Seeking help for using Tie : : File in my perl program for counting bases using Active Perl 5.10.1 Build 1007
by Anonymous Monk on Jan 30, 2012 at 06:47 UTC
Re: Seeking help for using Tie : : File in my perl program for counting bases using Active Perl 5.10.1 Build 1007
by rovf (Priest) on Jan 30, 2012 at 08:57 UTC

    In your original program, you slurped the whole file into memory using

    @DNA= <DNAFILE>
    and then used the array @DNA for processing the data. Of course it would be better to write your program in a way that it reads only one line at a time, but if the logic of your application already assumes that you have the data in an array, and your program is already working well, trying Tie::File doesn't cost much, because you don't have to change a lot. Instead of opening and reading the file, you would use
    tie @DNA, 'Tie::File', "your_filename_goes_here", OPTIONS_GO_HERE or d +ie "Can not tie file ($!)";
    In any case, have a look at the mode option.

    -- 
    Ronald Fischer <ynnor@mm.st>
Re: Seeking help for using Tie : : File in my perl program for counting bases using Active Perl 5.10.1 Build 1007
by GrandFather (Cardinal) on Jan 30, 2012 at 10:32 UTC

    There is no need at all to use Tie anything for this task. As with many many problems that are attacked using Perl the key is to use a hash. There are bunch of newbie foibles in your code that are worth trying to clean up sooner rather than later. I haven't described the issues explicitly, but have addressed them in the following code. Have a try using the code as a starting point, and please ask about anything you don't understand:

    #!usr/bin/perl use strict; use warnings; if (! @ARGV) { print <<HELP; Usage: > basecount.pl <bases file> HELP exit; } open my $dnaIn, '<', $ARGV[0] or die "Can't open bases file $ARGV[0]: +$!\n"; my %counts; my @baseList = qw(A T G C); while (defined (my $line = <$dnaIn>)) { chomp $line; ++$counts{$_} for grep {/\S/} split '', $line; } my $bases; my $errors; $bases += $_ for @counts{@baseList}; $errors += $_ for map {$counts{$_}} grep {! /[ATGC]/} keys %counts; print "Total bases: $bases\n"; print join (', ', map {"$_: $counts{$_}"} @baseList), "\n"; print "Errors: $errors\n" if $errors;

    Note that the code is untested and in any case I may not have understood what you are counting.

    True laziness is hard work

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://950702]
Approved by GrandFather
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others avoiding work at the Monastery: (17)
As of 2014-07-30 13:47 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (234 votes), past polls