in reply to
statistics of a large text
The standard way to do this type of text analysis on large bodies of text is to use MapReduce. The workflow for that looks like this:
- Take text, emit key/value pairs.
- Sort the key/value pairs by key then value.
- For each key,
organize process the sorted values.
With a typical framework like Hadoop
you only have to write the first and third steps, which are called Map and Reduce respectively. All three steps can also be distributed across multiple machines, allowing you to scale the work across a cluster.
In your example you can benefit from the same approach, even using just one machine, even without a framework.
Your fundamental problem is that you have 1 GB of text to handle. You're not going to succeed in keeping it all in memory. (Particularly not with how wasteful Perl is.) So don't even try, you need to plan on using the disk. And map-reduce uses disk in a way that is very friendly to how disks like to be used. (Stream data to and from, don't seek.)
What you should do is read your original file, and print out all of your n-grams to a second file. It will have lines of the form $n_gram: $line_number Then call the unix sort utility on the second file to get a third file that will have the exact same lines, only sorted by $n_gram, then line number. (Line numbers will be sorted asciibetically, not numerically.) Now take one pass through the third file to collapse to a file with lines of the form $n_gram: @line_numbers. (This file will be trivially sorted. If you care, you can sort your line numbers correctly before printing this file.) And now you can use the built-in module Search::Dict to quickly look up any n-gram of interest in that file. (But if you have to do any significant further processing of this data, I would recommend trying to think about that processing using a similar MapReduce idea. Doing lots of lookups means that you'll be seeking to disk a lot, and disk seeks are slow.)