Don't worry too much about micro-optimization. The key is to take advantage of the fact that an n-gram is all bunched together so you don't have to track too much information. I would do that something like this (untested):
#! /usr/bin/perl -w
use strict;
my $last_n_gram = "";
my @line_numbers;
while (<>) {
my ($n_gram, $line_number) = ($_ =~ /(.*): (\d+)$/);
if ($n_gram ne $last_n_gram and @line_numbers) {
@line_numbers = sort {$a <=> $b} @line_numbers;
print "$last_n_gram: @line_numbers\n";
$last_n_gram = $n_gram;
@line_numbers = ();
}
push @line_numbers, $line_number;
}
@line_numbers = sort {$a <=> $b} @line_numbers;
print "$last_n_gram: @line_numbers\n";
This assumes that you're going to reduce_step.pl intermediate_file > final_file. | [reply] [d/l] [select] |
Since you are more expert in memory usage and related issues, I have a question. Why when I store my 5 gb of file which has about 7m records of two columns, and I make two hashesh from two different files in the same format and size, even with a large ram (50gb) I run out of memory?
| [reply] |
| [reply] [d/l] |
Why when I store my 5 gb of file which has about 7m records of two columns, and I make two hashesh from two different files in the same format and size, even with a large ram (50gb) I run out of memory?
Assuming that your OS and Perl allow you full access to the full 50GB, you should not be running out of memory.
On a 64-bit system, a HoAs with 7 million keys and an average of 10 numbers per array requires ~3.5 GB. For two, reckon on 10 GB max.
I'm not aware of any restrictions or limits on the memory a 64-bit Perl can address, which leave you OS. Linux can apply per-process (and per-user?) limits to memory and cpu usage. I don't know what the commands are for discovering this information, but meybe that is somewhere you should be looking.
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] |