Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

statistics of a large text

by perl_lover_always (Acolyte)
on Jan 26, 2011 at 13:53 UTC ( #884345=perlquestion: print w/ replies, xml ) Need Help??
perl_lover_always has asked for the wisdom of the Perl Monks concerning the following question:

Hi dear monks I have written a code and it works. this is to see what word occurred in which lines. I have a very large text (more than one gigabyte) and I want to make my program very efficient to create a hash and store it for later use. Please tell me how can I make it more efficient and faster to deal with large data. also is it possible to store it more efficient in a less space consuming method to be retrieved faster. Thansk
use strict; use warnings; use Encode; use Storable; use Lingua::StopWords qw( getStopWords ); my $stopwords = getStopWords('en'); my $file_in = shift; my $file_out = shift; open (FILEIN,'<utf8',$file_in); #open (OUTPUT, '>utf8',$file_out); my $line_count=0; my %hash; for my $line (<FILEIN>){ $line_count++; my @ngrams=produce_ngrams($line); foreach my $ngram (@ngrams) { push(@{ $hash{$ngram} }, $line_count); } } store \%hash, $file_out; my %stat = (); %stat = %{retrieve($file_out)}; #foreach my $n (keys %stat) { print "$n: @{$stat{$n}}\n"; } sub produce_ngrams { my $string=shift; $string =~ s/[[:punct:]]//g; $string=lc($string); my @unigrams = split ' ',$string; #@unigrams= join ' ', grep { !$stopwords->{$_} } @unigrams; my @bigrams=(); my @trigrams=(); my @qgrams=(); my @cgrams=(); my @ngrams=(); for my $i (0..$#unigrams-1) { my $bigram = join " ", $unigrams[$i], $unigrams[$i+1]; push @bigrams, $bigram; } if ($#unigrams >= 2) { for my $i (0..$#unigrams-2) { my $trigram = join " ", $unigrams[$i], $unigrams[$i+1], $unigr +ams[$i+2]; push @trigrams, $trigram; }} if ($#unigrams >= 3) { for my $i (0..$#unigrams-3) { my $qgram = join " ", $unigrams[$i], $unigrams[$i+1], $unigram +s[$i+2],$unigrams[$i+3]; push @qgrams, $qgram; }} if ($#unigrams >= 4) { for my $i (0..$#unigrams-4) { my $cgram = join " ", $unigrams[$i], $unigrams[$i+1], $unigram +s[$i+2],$unigrams[$i+3],$unigrams[$i+4]; push @cgrams, $cgram; }} push @ngrams, @unigrams,@bigrams,@trigrams,@qgrams,@cgrams; return @ngrams; }

Comment on statistics of a large text
Download Code
Re: statistics of a large text
by toolic (Chancellor) on Jan 26, 2011 at 14:09 UTC
    This is no way addresses your question about optimization, but you could take advantage of Slices to reduce some of your code. For example, replace:
    my $cgram = join " ", $unigrams[$i], $unigrams[$i+1], $unigrams[$i+2], +$unigrams[$i+3],$unigrams[$i+4];
    with:
    my $cgram = "@unigrams[ $i .. $i + 4]";
    The double quotes around the array slice also insert a single space between array elements ($").
Re: statistics of a large text
by eff_i_g (Curate) on Jan 26, 2011 at 14:42 UTC
Re: statistics of a large text
by tilly (Archbishop) on Jan 26, 2011 at 14:48 UTC
    The standard way to do this type of text analysis on large bodies of text is to use MapReduce. The workflow for that looks like this:
    1. Take text, emit key/value pairs.
    2. Sort the key/value pairs by key then value.
    3. For each key, organize process the sorted values.
    With a typical framework like Hadoop you only have to write the first and third steps, which are called Map and Reduce respectively. All three steps can also be distributed across multiple machines, allowing you to scale the work across a cluster.

    In your example you can benefit from the same approach, even using just one machine, even without a framework.

    Your fundamental problem is that you have 1 GB of text to handle. You're not going to succeed in keeping it all in memory. (Particularly not with how wasteful Perl is.) So don't even try, you need to plan on using the disk. And map-reduce uses disk in a way that is very friendly to how disks like to be used. (Stream data to and from, don't seek.)

    What you should do is read your original file, and print out all of your n-grams to a second file. It will have lines of the form $n_gram: $line_number Then call the unix sort utility on the second file to get a third file that will have the exact same lines, only sorted by $n_gram, then line number. (Line numbers will be sorted asciibetically, not numerically.) Now take one pass through the third file to collapse to a file with lines of the form $n_gram: @line_numbers. (This file will be trivially sorted. If you care, you can sort your line numbers correctly before printing this file.) And now you can use the built-in module Search::Dict to quickly look up any n-gram of interest in that file. (But if you have to do any significant further processing of this data, I would recommend trying to think about that processing using a similar MapReduce idea. Doing lots of lookups means that you'll be seeking to disk a lot, and disk seeks are slow.)

      Thanks but I think what I do in the above code is same but without sorting ;) I guess your solution is even more time consuming since I can do it inline with my codes by just sorting the keys and values!
        Your guess is wrong.

        You asked for advice on handling large amounts of data (~ 1 GB). With that much data your code will fail to run because it will run out of memory long before you finish. By contrast the approach that I describe should succeed in a matter of minutes.

        If you wish to persist in your approach you can tie hash to an on disk data structure, for instance using DBM::Deep. Do not be surprised if your code now takes a month or two to run on your dataset. (A billion seeks to disk takes about 2 months. And you're going to wind up with, order of magnitude, about that many seeks.) This is substantially longer than my approach.

        If my suggestion fails to perform well enough, it is fairly easy to use Hadoop to scale your processing across a cluster. (Clusters are easy to set up using EC 2.) This approach scales as far as you want - in fact it is the technique that Google uses to process copies of the entire web.

      How can I use search::dict while my line is like this:
      "$key\t@numbers" ?
      example:
      the\t1 5 8 34 56 the one\t1 5 56 the one of\t1 56
      I want to search for "the one of" and retrieve "1 56" or an array including 1 and 56. Thanks alot in advance.
        Assuming that you've done a use Search::Dict; and $fh is an open filehandle to your file then (untested code written by drunk person alert):
        sub match { my $string = quotemeta(shift); Search::Dict::look($fh, $string); my $next_line = <$fh>; if ($next_line =~ /$string: (\d+\s*)+/) { return split /\s+/, $1; } }
      Hi Is there anyway to skip the sort part! linux sort (with -d and -f) options takes a long time to run! how could I scape that?
        No, there isn't. Sorting does the actual work, and therefore takes the bulk of the time. If it is too slow, then it is time for you to look into distributing (and parallelizing) the work with Hadoop.
Re: statistics of a large text
by JavaFan (Canon) on Jan 26, 2011 at 16:04 UTC
    Please tell me how can I make it more efficient and faster to deal with large data. also is it possible to store it more efficient in a less space consuming method to be retrieved faster.
    Be careful in what you want to optimize. You're asking to optimize storing of data. But that's usually the wrong question to ask - it's far more important to optimize your queries. After all, the fastest way to store your input file is to just copy it.

    Determine what kinds of queries you need to do, and use that to determine what kind of datastructure you need.

Re: statistics of a large text
by BrowserUk (Pope) on Jan 26, 2011 at 16:32 UTC

    Does your current code work for your 1GB input file? How long does it take?

      unfortunately it goes out of memory!
Re: statistics of a large text
by sundialsvc4 (Monsignor) on Jan 26, 2011 at 20:13 UTC

    Always remember this:   virtual memory is “a disk file.”   The real-memory RAM acts as a very excellent and intelligent “buffer,” but it only works because of that peculiar property known as http://en.wikipedia.org/wiki/Locality_of_reference.

    The appropriate method for you to use here is ... writing to files, sorting those files, and comparing the sorted streams.   It worked beautifully in COBOL (and even for punched-cards before computers were invented), and it produces predictable performance for arbitrary quantities of data.

    When you said, in one sentence, “gigabytes of” and “memory,” I stopped reading ... as did everyone else.   It was not necessary to know the details.

Re: statistics of a large text
by planetscape (Canon) on Jan 26, 2011 at 23:04 UTC
      I need to keep track of the line numbers!

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://884345]
Approved by marto
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others meditating upon the Monastery: (18)
As of 2014-07-25 16:20 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (173 votes), past polls