Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

Seeking help for using Tie : : File in my perl program for counting bases using Active Perl 5.10.1 Build 1007

by supriyoch_2008 (Scribe)
on Jan 30, 2012 at 05:56 UTC ( #950702=perlquestion: print w/ replies, xml ) Need Help??
supriyoch_2008 has asked for the wisdom of the Perl Monks concerning the following question:

Hi Perl Monks,

I am a beginner in perl programming. I have written a PERL program to count the number of bases in a DNA molecule from a text file in command prompt in Windows XP. The program works well with small text files having DNA sequence data and gives correct results.

But when I tried to count the number of bases from a 299 MB text file, the program shows “out of memory” in cmd after nearly 2 minutes. One Perl monk suggested me to use Tie: : File to solve this problem. I looked for the syntax of Tie : : File in internet but could not make out how to use it in my program with  <STDIN> input operator assigned to a scalar variable like $DNAfilename=<STDIN>;

I have given the perl program below. May I expect to get help from any perl monk to sort out this problem in my program either using Tie : : File or other method so that I can analyze a 299MB text file. I have a 2-GB RAM in my laptop with Active Perl 5.10.1 Build 1007 version.

Can I use a code like  %mem=300 MW; in perl because this type of code is used by some theoretical and physical chemists in specific programs based on c programming for “Quantitative Structure Analysis and Report (QSAR)” for biomolecules to solve memory problem?

#!usr/bin/perl print "\n\nPlease type the filename of the DNA sequence data: "; $DNAfilename=<STDIN>; chomp $DNAfilename; unless ( open(DNAFILE, $DNAfilename) ) { print "Cannot open file \"$DNAfilename\"\n\n"; exit; } while(@DNA= <DNAFILE>) { $DNA=join('',@DNA); close DNAFILE; # Remove whitespace $DNA=~ s/\s//g; # Remove whitespace $DNA=~ s/\s//g;# Line 15 # Count number of bases $b=length($DNA); print "\nNumber of bases: $b.\nDoes the value tally with GenBank recor +d? If yes,continue."; # Count number of each base and nonbase $A=0;$T=0;$G=0;$C=0;$e=0; # Line 20 while($DNA=~ /A/ig){$A++} while($DNA=~ /T/ig){$T++} while($DNA=~ /G/ig){$G++} while($DNA=~ /C/ig){$C++} while($DNA=~ /[^ATGC]/ig){$e++} # Line 25 print "\nA=$A; T=$T; G=$G; C=$C; Errors(N)=$e.\n."; } exit;

I have given the cmd output below.

Microsoft Windows XP Version 5.1.2600

(C) Copyright 1985-2001 Microsoft Corp.

C:\Documents and Settings\user>cd d*

C:\Documents and Settings\user\Desktop>m.pl

Please type the filename of the DNA sequence data: manjur.txt

Out of memory!

C:\Documents and Settings\user\Desktop>

I am ever grateful to perl monks for their quick reply with suggestions.

Comment on Seeking help for using Tie : : File in my perl program for counting bases using Active Perl 5.10.1 Build 1007
Select or Download Code
Re: Seeking help for using Tie : : File in my perl program for counting bases using Active Perl 5.10.1 Build 1007
by Anonymous Monk on Jan 30, 2012 at 06:47 UTC
Re: Seeking help for using Tie : : File in my perl program for counting bases using Active Perl 5.10.1 Build 1007
by rovf (Priest) on Jan 30, 2012 at 08:57 UTC

    In your original program, you slurped the whole file into memory using

    @DNA= <DNAFILE>
    and then used the array @DNA for processing the data. Of course it would be better to write your program in a way that it reads only one line at a time, but if the logic of your application already assumes that you have the data in an array, and your program is already working well, trying Tie::File doesn't cost much, because you don't have to change a lot. Instead of opening and reading the file, you would use
    tie @DNA, 'Tie::File', "your_filename_goes_here", OPTIONS_GO_HERE or d +ie "Can not tie file ($!)";
    In any case, have a look at the mode option.

    -- 
    Ronald Fischer <ynnor@mm.st>
Re: Seeking help for using Tie : : File in my perl program for counting bases using Active Perl 5.10.1 Build 1007
by GrandFather (Sage) on Jan 30, 2012 at 10:32 UTC

    There is no need at all to use Tie anything for this task. As with many many problems that are attacked using Perl the key is to use a hash. There are bunch of newbie foibles in your code that are worth trying to clean up sooner rather than later. I haven't described the issues explicitly, but have addressed them in the following code. Have a try using the code as a starting point, and please ask about anything you don't understand:

    #!usr/bin/perl use strict; use warnings; if (! @ARGV) { print <<HELP; Usage: > basecount.pl <bases file> HELP exit; } open my $dnaIn, '<', $ARGV[0] or die "Can't open bases file $ARGV[0]: +$!\n"; my %counts; my @baseList = qw(A T G C); while (defined (my $line = <$dnaIn>)) { chomp $line; ++$counts{$_} for grep {/\S/} split '', $line; } my $bases; my $errors; $bases += $_ for @counts{@baseList}; $errors += $_ for map {$counts{$_}} grep {! /[ATGC]/} keys %counts; print "Total bases: $bases\n"; print join (', ', map {"$_: $counts{$_}"} @baseList), "\n"; print "Errors: $errors\n" if $errors;

    Note that the code is untested and in any case I may not have understood what you are counting.

    True laziness is hard work

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://950702]
Approved by GrandFather
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (6)
As of 2015-07-04 16:01 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (60 votes), past polls