Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

Re: statistics of a large text

by tilly (Archbishop)
on Jan 26, 2011 at 14:48 UTC ( #884360=note: print w/ replies, xml ) Need Help??


in reply to statistics of a large text

The standard way to do this type of text analysis on large bodies of text is to use MapReduce. The workflow for that looks like this:

  1. Take text, emit key/value pairs.
  2. Sort the key/value pairs by key then value.
  3. For each key, organize process the sorted values.
With a typical framework like Hadoop you only have to write the first and third steps, which are called Map and Reduce respectively. All three steps can also be distributed across multiple machines, allowing you to scale the work across a cluster.

In your example you can benefit from the same approach, even using just one machine, even without a framework.

Your fundamental problem is that you have 1 GB of text to handle. You're not going to succeed in keeping it all in memory. (Particularly not with how wasteful Perl is.) So don't even try, you need to plan on using the disk. And map-reduce uses disk in a way that is very friendly to how disks like to be used. (Stream data to and from, don't seek.)

What you should do is read your original file, and print out all of your n-grams to a second file. It will have lines of the form $n_gram: $line_number Then call the unix sort utility on the second file to get a third file that will have the exact same lines, only sorted by $n_gram, then line number. (Line numbers will be sorted asciibetically, not numerically.) Now take one pass through the third file to collapse to a file with lines of the form $n_gram: @line_numbers. (This file will be trivially sorted. If you care, you can sort your line numbers correctly before printing this file.) And now you can use the built-in module Search::Dict to quickly look up any n-gram of interest in that file. (But if you have to do any significant further processing of this data, I would recommend trying to think about that processing using a similar MapReduce idea. Doing lots of lookups means that you'll be seeking to disk a lot, and disk seeks are slow.)


Comment on Re: statistics of a large text
Select or Download Code
Re^2: statistics of a large text
by perl_lover_always (Acolyte) on Jan 26, 2011 at 15:25 UTC
    Thanks but I think what I do in the above code is same but without sorting ;) I guess your solution is even more time consuming since I can do it inline with my codes by just sorting the keys and values!
      Your guess is wrong.

      You asked for advice on handling large amounts of data (~ 1 GB). With that much data your code will fail to run because it will run out of memory long before you finish. By contrast the approach that I describe should succeed in a matter of minutes.

      If you wish to persist in your approach you can tie hash to an on disk data structure, for instance using DBM::Deep. Do not be surprised if your code now takes a month or two to run on your dataset. (A billion seeks to disk takes about 2 months. And you're going to wind up with, order of magnitude, about that many seeks.) This is substantially longer than my approach.

      If my suggestion fails to perform well enough, it is fairly easy to use Hadoop to scale your processing across a cluster. (Clusters are easy to set up using EC 2.) This approach scales as far as you want - in fact it is the technique that Google uses to process copies of the entire web.

        yes you are right after testing I can notice that it goes out of memory!
        Thanks! in the third step of your approach, how can I merge the $line_number to @line_number in a fast way knowing that now my file is even bigger than before? any advise on that?
Re^2: statistics of a large text
by perl_lover_always (Acolyte) on Jan 31, 2011 at 14:51 UTC
    How can I use search::dict while my line is like this:
    "$key\t@numbers" ?
    example:
    the\t1 5 8 34 56 the one\t1 5 56 the one of\t1 56
    I want to search for "the one of" and retrieve "1 56" or an array including 1 and 56. Thanks alot in advance.
      Assuming that you've done a use Search::Dict; and $fh is an open filehandle to your file then (untested code written by drunk person alert):
      sub match { my $string = quotemeta(shift); Search::Dict::look($fh, $string); my $next_line = <$fh>; if ($next_line =~ /$string: (\d+\s*)+/) { return split /\s+/, $1; } }
Re^2: statistics of a large text
by perl_lover_always (Acolyte) on Feb 08, 2011 at 11:05 UTC
    Hi Is there anyway to skip the sort part! linux sort (with -d and -f) options takes a long time to run! how could I scape that?
      No, there isn't. Sorting does the actual work, and therefore takes the bulk of the time. If it is too slow, then it is time for you to look into distributing (and parallelizing) the work with Hadoop.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://884360]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others musing on the Monastery: (14)
As of 2014-07-25 13:02 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (171 votes), past polls