in reply to Re^4: statistics of a large text
in thread statistics of a large text
Don't worry too much about micro-optimization. The key is to take advantage of the fact that an n-gram is all bunched together so you don't have to track too much information. I would do that something like this (untested):
This assumes that you're going to reduce_step.pl intermediate_file > final_file.#! /usr/bin/perl -w use strict; my $last_n_gram = ""; my @line_numbers; while (<>) { my ($n_gram, $line_number) = ($_ =~ /(.*): (\d+)$/); if ($n_gram ne $last_n_gram and @line_numbers) { @line_numbers = sort {$a <=> $b} @line_numbers; print "$last_n_gram: @line_numbers\n"; $last_n_gram = $n_gram; @line_numbers = (); } push @line_numbers, $line_number; } @line_numbers = sort {$a <=> $b} @line_numbers; print "$last_n_gram: @line_numbers\n";
|
---|
Replies are listed 'Best First'. | |
---|---|
Re^6: statistics of a large text
by perl_lover_always (Acolyte) on Feb 10, 2011 at 11:13 UTC | |
by BrowserUk (Patriarch) on Feb 10, 2011 at 12:44 UTC | |
by perl_lover_always (Acolyte) on Feb 10, 2011 at 13:40 UTC | |
by BrowserUk (Patriarch) on Feb 10, 2011 at 14:13 UTC | |
by perl_lover_always (Acolyte) on Feb 10, 2011 at 14:21 UTC | |
by BrowserUk (Patriarch) on Feb 10, 2011 at 15:14 UTC | |
by perl_lover_always (Acolyte) on Feb 10, 2011 at 15:33 UTC | |
by BrowserUk (Patriarch) on Feb 10, 2011 at 15:54 UTC | |
by perl_lover_always (Acolyte) on Feb 10, 2011 at 16:10 UTC | |
| |
by marto (Cardinal) on Feb 10, 2011 at 14:32 UTC |
In Section
Seekers of Perl Wisdom