Re^5: statistics of a large text

in reply to Re^4: statistics of a large text
in thread statistics of a large text

Don't worry too much about micro-optimization. The key is to take advantage of the fact that an n-gram is all bunched together so you don't have to track too much information. I would do that something like this (untested):

#! /usr/bin/perl -w
use strict;

my $last_n_gram = "";
my @line_numbers;
while (<>) {
    my ($n_gram, $line_number) = ($_ =~ /(.*): (\d+)$/);
    if ($n_gram ne $last_n_gram and @line_numbers) {
        @line_numbers = sort {$a <=> $b} @line_numbers;
        print "$last_n_gram: @line_numbers\n";
        $last_n_gram = $n_gram;
        @line_numbers = ();
    }
    push @line_numbers, $line_number;
}
@line_numbers = sort {$a <=> $b} @line_numbers;
print "$last_n_gram: @line_numbers\n";
[download]

This assumes that you're going to reduce_step.pl intermediate_file > final_file.

Comment on Re^5: statistics of a large text Select or Download Code

Replies are listed 'Best First'.
Re^6: statistics of a large text by perl_lover_always (Acolyte) on Feb 10, 2011 at 11:13 UTC
Since you are more expert in memory usage and related issues, I have a question. Why when I store my 5 gb of file which has about 7m records of two columns, and I make two hashesh from two different files in the same format and size, even with a large ram (50gb) I run out of memory?	[reply]
Re^7: statistics of a large text by BrowserUk (Patriarch) on Feb 10, 2011 at 12:44 UTC
What does `perl -V` output on your system? Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l]
Re^8: statistics of a large text by perl_lover_always (Acolyte) on Feb 10, 2011 at 13:40 UTC
This is perl, v5.8.8 built for x86_64-linux-thread-multi	[reply]
Re^7: statistics of a large text by BrowserUk (Patriarch) on Feb 10, 2011 at 14:13 UTC
Why when I store my 5 gb of file which has about 7m records of two columns, and I make two hashesh from two different files in the same format and size, even with a large ram (50gb) I run out of memory? Assuming that your OS and Perl allow you full access to the full 50GB, you should not be running out of memory. On a 64-bit system, a HoAs with 7 million keys and an average of 10 numbers per array requires ~3.5 GB. For two, reckon on 10 GB max. I'm not aware of any restrictions or limits on the memory a 64-bit Perl can address, which leave you OS. Linux can apply per-process (and per-user?) limits to memory and cpu usage. I don't know what the commands are for discovering this information, but meybe that is somewhere you should be looking. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply]
Re^8: statistics of a large text by perl_lover_always (Acolyte) on Feb 10, 2011 at 14:21 UTC
I have no idea since I can access the whole memory (all 50 GB) Do you think it has something to do with my code?	[reply]
Re^9: statistics of a large text by BrowserUk (Patriarch) on Feb 10, 2011 at 15:14 UTC
Re^10: statistics of a large text by perl_lover_always (Acolyte) on Feb 10, 2011 at 15:33 UTC
Re^9: statistics of a large text by BrowserUk (Patriarch) on Feb 10, 2011 at 15:54 UTC
Re^10: statistics of a large text by perl_lover_always (Acolyte) on Feb 10, 2011 at 16:10 UTC
Some notes below your chosen depth have not been shown here
Re^9: statistics of a large text by marto (Cardinal) on Feb 10, 2011 at 14:32 UTC

In Section Seekers of Perl Wisdom