Don't worry too much about micro-optimization. The key is to take advantage of the fact that an n-gram is all bunched together so you don't have to track too much information. I would do that something like this (untested):
#! /usr/bin/perl -w
use strict;
my $last_n_gram = "";
my @line_numbers;
while (<>) {
my ($n_gram, $line_number) = ($_ =~ /(.*): (\d+)$/);
if ($n_gram ne $last_n_gram and @line_numbers) {
@line_numbers = sort {$a <=> $b} @line_numbers;
print "$last_n_gram: @line_numbers\n";
$last_n_gram = $n_gram;
@line_numbers = ();
}
push @line_numbers, $line_number;
}
@line_numbers = sort {$a <=> $b} @line_numbers;
print "$last_n_gram: @line_numbers\n";
This assumes that you're going to
reduce_step.pl intermediate_file > final_file.