Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

Re^5: statistics of a large text

by tilly (Archbishop)
on Jan 27, 2011 at 15:05 UTC ( #884570=note: print w/ replies, xml ) Need Help??


in reply to Re^4: statistics of a large text
in thread statistics of a large text

Don't worry too much about micro-optimization. The key is to take advantage of the fact that an n-gram is all bunched together so you don't have to track too much information. I would do that something like this (untested):

#! /usr/bin/perl -w use strict; my $last_n_gram = ""; my @line_numbers; while (<>) { my ($n_gram, $line_number) = ($_ =~ /(.*): (\d+)$/); if ($n_gram ne $last_n_gram and @line_numbers) { @line_numbers = sort {$a <=> $b} @line_numbers; print "$last_n_gram: @line_numbers\n"; $last_n_gram = $n_gram; @line_numbers = (); } push @line_numbers, $line_number; } @line_numbers = sort {$a <=> $b} @line_numbers; print "$last_n_gram: @line_numbers\n";
This assumes that you're going to reduce_step.pl intermediate_file > final_file.


Comment on Re^5: statistics of a large text
Select or Download Code
Re^6: statistics of a large text
by perl_lover_always (Acolyte) on Feb 10, 2011 at 11:13 UTC
    Since you are more expert in memory usage and related issues, I have a question. Why when I store my 5 gb of file which has about 7m records of two columns, and I make two hashesh from two different files in the same format and size, even with a large ram (50gb) I run out of memory?

      What does perl -V output on your system?


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        This is perl, v5.8.8 built for x86_64-linux-thread-multi
      Why when I store my 5 gb of file which has about 7m records of two columns, and I make two hashesh from two different files in the same format and size, even with a large ram (50gb) I run out of memory?

      Assuming that your OS and Perl allow you full access to the full 50GB, you should not be running out of memory.

      On a 64-bit system, a HoAs with 7 million keys and an average of 10 numbers per array requires ~3.5 GB. For two, reckon on 10 GB max.

      I'm not aware of any restrictions or limits on the memory a 64-bit Perl can address, which leave you OS. Linux can apply per-process (and per-user?) limits to memory and cpu usage. I don't know what the commands are for discovering this information, but meybe that is somewhere you should be looking.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        I have no idea since I can access the whole memory (all 50 GB) Do you think it has something to do with my code?

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://884570]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others pondering the Monastery: (18)
As of 2014-09-18 12:29 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (113 votes), past polls