Beefy Boxes and Bandwidth Generously Provided by pair Networks
Come for the quick hacks, stay for the epiphanies.
 
PerlMonks  

tying a hash from a big dictionary

by Anonymous Monk
on Oct 31, 2011 at 13:21 UTC ( [id://934875]=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi monks I have a very big dictionary of phrases (millions of entries) and when I read it to hash I'm out of memory. I want to store it once in a local memory and every time I can use it efficiently as my dictionary. What modules do you recommend for : 1- tie it in the local memory, and 2- accessing it efficiently. I use this code to read my dictionary:
sub read_dict{ my $file=shift; my %dict; open( FILE, "<:encoding(utf5)", $file ); while (<FILE>) { chomp; my ($p1, $p2) = split /\t/; chomp ($p1, $p2); push( @{$dict{$p1}}, $p2 ); } close FILE; return %dict; }
Note that while I read it I run out of memory so I need to be able to read each and store instead of reading the whole dictionary. later I use this hash to look up for many many entries in my text. Thanks for your hints and code snippets in advance.

Replies are listed 'Best First'.
Re: tying a hash from a big dictionary
by BrowserUk (Patriarch) on Oct 31, 2011 at 13:39 UTC
    I use this code to read my dictionary:

    You are using far more (double maybe even triple the memory requirement) because of the way you are returning the data from your subroutine.

    It may not be enough to relieve your out-of-memory situation, but try this before you seek other more complex and inevitably slower solutions:

    sub read_dict{ my $file = shift; my %dict; open( my $fh, "<:encoding(utf5)", $file ); while( <FILE> ) { chomp; ## no need to chomp twice my ($p1, $p2) = split /\t/; push( @{ $dict{ $p1 } }, $p2 ); } close $fh; return \%dict; ## main space saving change; return a ref to the ha +sh } ... my $dict = read_dict( $dict_name ); ... for my $next_phrase ( @{ $dict->{ $key } } ){ ... }

    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      That was a nice one thanks! Although I still have memory problem, but this tip saved me a lot as well!

        How many lines has your file? How many of those are you succeeding in loading before you run out of memory?


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
        on a 4gb machine, it will run out of memory after 5m of dictionary lines.
Re: tying a hash from a big dictionary
by johngg (Canon) on Oct 31, 2011 at 13:42 UTC

    If your huge dictionary will not fit in memory then perhaps you should look at a disk-based DBM, perhaps Berkeley DB.

    Cheers,

    JohnGG

Re: tying a hash from a big dictionary
by repellent (Priest) on Nov 01, 2011 at 09:23 UTC
    You can try accessing the dictionary file directly using the Search::Dict core module, assuming your dictionary is sorted. It performs a binary search through the file. Here, I've wrapped its functionality into an OO-module for convenience:
    use Data::Dumper; use Search::Dict::Object; my $d = Search::Dict::Object->new( file => "/tmp/dict.txt", keyval_xfrm => sub { split /\t/ }, comp => sub { $_[0] cmp $_[1] }, # should correspond to file sort +order ); print Dumper { aaa => $d->get('aaa'), foo => $d->get('foo'), bar => $d->get('bar'), baz => $d->get('baz'), zzz => $d->get('zzz'), }; __END__ $VAR1 = { 'bar' => '789', 'baz' => '456', 'aaa' => undef, 'foo' => '123', 'zzz' => undef };

    The dictionary file:
    $ cat /tmp/dict.txt aho 234 bar 789 bat 567 baz 456 cut 678 foo 123 yyy 000

    The Search::Dict::Object package:
      I've wrapped its functionality into an OO-module for convenience:

      "I've wrapped your bicycle in tissue paper and a nice bow." -- but it sure ain't for "convenience" :)


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        I find it convenient to have a single transform sub that produces the key-value pair for the object to search+parse a hash-like dict file. Handling/closing of filehandle is really just cake icing.

        Search::Dict sets the filehandle position to the first line greater than or equal $key. This seems pretty raw to me (read: that I should probably write some wrapper that takes care of the edge cases). The OO stick is not always the first thing I reach for, in case you're wondering.
Re: tying a hash from a big dictionary
by tokpela (Chaplain) on Nov 01, 2011 at 18:14 UTC

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://934875]
Approved by BrowserUk
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others learning in the Monastery: (5)
As of 2024-04-18 11:01 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found