Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

Request to detect the mistake in a perl script for finding inter-substring distance from a large text file

by supriyoch_2008 (Scribe)
on Jan 24, 2012 at 09:39 UTC ( #949618=perlquestion: print w/ replies, xml ) Need Help??
supriyoch_2008 has asked for the wisdom of the Perl Monks concerning the following question:

Hi Perl Monks,

I am a beginner in perl programming. I have written a perl script which can read a small text file and gives correct results for inter-substring distance in cmd in Windows XP. But cmd shows the problem of “out of memory” when I try to analyze a large text file with 219475005 letters for finding the inter-substring distance although the program counts the number of each letter in the file correctly within 2 minutes but fails to find the inter-substring distance. I think this could due to incorrect reading of file.

So I have given the initial part of the script and the results of cmd screen below. I am seeking your suggestions to rectify the mistake in the script for analyzing a large file.

Furthermore, I need the syntax at the initial part to assign the input large file to an array variable like my @lines so that I can assign this array to a scalar variable like my $string ="@lines"; for use in later part of the script.

#!/usr/bin/perl –w print "\n\nPlease type the filename: "; $DNAfilename = <STDIN>; chomp $DNAfilename; # open the large file unless ( open(DNAFILE, $DNAfilename) ) { print "Cannot open file \"$DNAfilename\"\n\n"; exit; } my @lines = <DNAFILE>; while (<DNAFILE>) { say $_; } close DNAFILE; $DNA = join( '', @lines); # Remove whitespace $DNA=~ s/\s//g; # Count number of bases $b=length($DNA); print "\nNumber of bases: $b."; # Count number of each base and nonbase $A=0;$T=0;$G=0;$C=0;$e=0; while($DNA=~ /A/ig){$A++} while($DNA=~ /T/ig){$T++} while($DNA=~ /G/ig){$G++} while($DNA=~ /C/ig){$C++} while($DNA=~ /[^ATGC]/ig){$e++} . . . .

Command Prompt Results:

C:\Documents and Settings\user\Desktop>m3.pl

Please type the filename of the DNA sequence data: chr1.txt

Number of bases: 219475005.

A=63473407; T=63582431; G=45425056; C=45435903; Errors(N)=1558208.

Enter a motif to count nt between two such motifs: GAATTCCT

I found the motif!

Out of memory!

C:\Documents and Settings\user\Desktop>

Thanks to Perl Monks for their quick reply in solving perl problems.

Comment on Request to detect the mistake in a perl script for finding inter-substring distance from a large text file
Select or Download Code
Re: Request to detect the mistake in a perl script for finding inter-substring distance from a large text file
by rovf (Priest) on Jan 24, 2012 at 10:02 UTC
    my @lines = <DNAFILE>; while (<DNAFILE>) { say $_; }
    This piece of code doesn't make sense. First, you read the *whole* file into memory (storing it at @lines), and then you try to read another line (your while loop), which is, of course, not possible. Your loop won't be executed; you can remove it without harm.

    But the main problem is that you read the whole file into memory and process it from there. No wonder that your memory gets exhausted sooner or later (try to pour a whole bottle of beer into a coffee cup; unless the cup is really huge, you will spill some beer).
    Maybe Tie::File will help you as a first start. It allows you to treat the whole file as an array, without slurping it into memory. Be aware that, possibly, the runtime of your application will increase.

    -- 
    Ronald Fischer <ynnor@mm.st>
Re: Request to detect the mistake in a perl script for finding inter-substring distance from a large text file
by johngg (Abbot) on Jan 24, 2012 at 10:25 UTC

    A better way to count your bases would be to use tr.

    knoppix@Microknoppix:~$ perl -Mstrict -wE ' > my $dna = q{CCATGNGTTATGNGTTACACGTNGTNTACG}; > my $b = length $dna; > my $A = $dna =~ tr{A}{}; > my $C = $dna =~ tr{C}{}; > my $G = $dna =~ tr{G}{}; > my $T = $dna =~ tr{T}{}; > my $e = $b - ( $A + $C + $G + $T ); > say qq{Number of bases - $b}; > say qq{A = $A; C = $C; G = $G; T = $T; err = $e};' Number of bases - 30 A = 5; C = 5; G = 7; T = 9; err = 4 knoppix@Microknoppix:~$

    I hope this is helpful.

    Cheers,

    JohnGG

Re: Request to detect the mistake in a perl script for finding inter-substring distance from a large text file
by lune (Monk) on Jan 24, 2012 at 12:09 UTC
    As far as I can see there is no need to
    a) read in the whole file at once
    b) paste the lines together

    Your count won't change if you do the counting line by line - which would solve your memory problem. So it would be worth looking at the part of your program that needs the whole file as string to see, whether this could be changed too. If not, there is the proposition about using "tie" already.

    Then, you can search for all valid characters at once and use a hash to collect and count them.

    This would be a possible solution:

    #!/usr/bin/perl -w use strict; use warnings; use diagnostics; use Data::Dumper; my $filename = "dna.txt"; open(my $fh, "<", $filename) || die "could not open $filename: $!\n"; my %bases; my $cnt_errors = 0; while (<$fh>) { # strip spaces s/\s+//ig; # collect results my @results = ($_ =~ /[ACGT]/ig); map { $bases{$_}++ } @results; $cnt_errors += ( length($_) - scalar @results ); } print Dumper(%bases); print "Errors: $cnt_errors\n";

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://949618]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others romping around the Monastery: (11)
As of 2014-12-22 16:34 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (121 votes), past polls