Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

Re^4: statistics of a large text

by perl_lover_always (Acolyte)
on Jan 27, 2011 at 09:59 UTC ( #884526=note: print w/ replies, xml ) Need Help??


in reply to Re^3: statistics of a large text
in thread statistics of a large text

Thanks! in the third step of your approach, how can I merge the $line_number to @line_number in a fast way knowing that now my file is even bigger than before? any advise on that?


Comment on Re^4: statistics of a large text
Re^5: statistics of a large text
by tilly (Archbishop) on Jan 27, 2011 at 15:05 UTC
    Don't worry too much about micro-optimization. The key is to take advantage of the fact that an n-gram is all bunched together so you don't have to track too much information. I would do that something like this (untested):
    #! /usr/bin/perl -w use strict; my $last_n_gram = ""; my @line_numbers; while (<>) { my ($n_gram, $line_number) = ($_ =~ /(.*): (\d+)$/); if ($n_gram ne $last_n_gram and @line_numbers) { @line_numbers = sort {$a <=> $b} @line_numbers; print "$last_n_gram: @line_numbers\n"; $last_n_gram = $n_gram; @line_numbers = (); } push @line_numbers, $line_number; } @line_numbers = sort {$a <=> $b} @line_numbers; print "$last_n_gram: @line_numbers\n";
    This assumes that you're going to reduce_step.pl intermediate_file > final_file.
      Since you are more expert in memory usage and related issues, I have a question. Why when I store my 5 gb of file which has about 7m records of two columns, and I make two hashesh from two different files in the same format and size, even with a large ram (50gb) I run out of memory?

        What does perl -V output on your system?


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
        Why when I store my 5 gb of file which has about 7m records of two columns, and I make two hashesh from two different files in the same format and size, even with a large ram (50gb) I run out of memory?

        Assuming that your OS and Perl allow you full access to the full 50GB, you should not be running out of memory.

        On a 64-bit system, a HoAs with 7 million keys and an average of 10 numbers per array requires ~3.5 GB. For two, reckon on 10 GB max.

        I'm not aware of any restrictions or limits on the memory a 64-bit Perl can address, which leave you OS. Linux can apply per-process (and per-user?) limits to memory and cpu usage. I don't know what the commands are for discovering this information, but meybe that is somewhere you should be looking.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://884526]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (6)
As of 2014-12-28 07:36 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (179 votes), past polls