Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

Re^7: statistics of a large text

by BrowserUk (Pope)
on Feb 10, 2011 at 14:13 UTC ( #887455=note: print w/replies, xml ) Need Help??


in reply to Re^6: statistics of a large text
in thread statistics of a large text

Why when I store my 5 gb of file which has about 7m records of two columns, and I make two hashesh from two different files in the same format and size, even with a large ram (50gb) I run out of memory?

Assuming that your OS and Perl allow you full access to the full 50GB, you should not be running out of memory.

On a 64-bit system, a HoAs with 7 million keys and an average of 10 numbers per array requires ~3.5 GB. For two, reckon on 10 GB max.

I'm not aware of any restrictions or limits on the memory a 64-bit Perl can address, which leave you OS. Linux can apply per-process (and per-user?) limits to memory and cpu usage. I don't know what the commands are for discovering this information, but meybe that is somewhere you should be looking.


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

Replies are listed 'Best First'.
Re^8: statistics of a large text
by perl_lover_always (Acolyte) on Feb 10, 2011 at 14:21 UTC
    I have no idea since I can access the whole memory (all 50 GB) Do you think it has something to do with my code?
      I have no idea since I can access the whole memory (all 50 GB)

      How do you know you can access the whole of memory?

      What happens if you run this code?:

      perl -e' $x = chr(0) x ( 1024**3 * 12 ) '

      That will attempt to allocate a single 12GB lump of memory. If it fails, then try adjusting the 12 to a lower value to discover how much memory Perl can actually allocate.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        it does not fail!
      Do you think it has something to do with my code?

      It is beginning to look that way. Can you post the latest version of your code?


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        The code is tested with a toy example!
        #!/usr/bin/perl use strict; use warnings; use XML::LibXML; use List::Util qw(sum); use diagnostics; my $RTE = shift; my $file_in_es=shift; my $file_in_en=shift; my $out =shift; my $no; my %lemma_hypo=(); my %stem_hypo=(); my %token_hypo=(); #reading RTE corpus my $parser = XML::LibXML->new(); my $doc = $parser->parse_file( $RTE ); #create hash files from ngram statistics my %hash_en=to_hash($file_in_en); my %hash_es=to_hash($file_in_es); #open to write the results! open (OUTPUT, ">$out"); for my $n (1..800){ my $entailment = $doc->find( '//pair[@id = '.$n.']/@entailment' ); # READ LEMMA, TOKENS and STEMS of TEXT. my $lemma_text = $doc->find( '//pair[@id = '.$n.']/tAnnotation/word/at +tribute[@name="lemma"]' ); my @lemma_text = &to_Array($lemma_text); @lemma_text = remove_punc(@lemma_text); my $token_text = $doc->find( '//pair[@id = '.$n.']/tAnnotation/word/at +tribute[@name="token"]' ); my @token_text = &to_Array($token_text); @token_text = remove_punc(@token_text); my $stem_text = $doc->find( '//pair[@id = '.$n.']/tAnnotation/word/att +ribute[@name="stem"]' ); my @stem_text = &to_Array($stem_text); @stem_text = remove_punc(@stem_text); my $hypo = $doc->find( '//pair[@id = '.$n.']/h' ); my @hypo = to_Array($hypo); my @MI = (); my $MI = 0; #FOR EACH HYPO for my $x (0..$#hypo) { my @MI_x = (); # READ LEMMA, TOKENS and STEMS of EACH HYPOTHESIS. my $no=$x+1; my $lemma_hypo = $doc->find( '//pair[@id = '.$n.']/hAnnotation +[@no = '.$no.']/word/attribute[@name="lemma"]' ); @{$lemma_hypo{$x}} = &to_Array($lemma_hypo); @{$lemma_hypo{$x}} = remove_punc(@{$lemma_hypo{$x}}); my $token_hypo = $doc->find( '//pair[@id = '.$n.']/hAnnotation +[@no = '.$no.']/word/attribute[@name="token"]' ); @{$token_hypo{$x}} = &to_Array($token_hypo); @{$token_hypo{$x}} = remove_punc(@{$token_hypo{$x}}); my $stem_hypo = $doc->find( '//pair[@id = '.$n.']/hAnnotation[ +@no = '.$no.']/word/attribute[@name="stem"]' ); @{$stem_hypo{$x}} = &to_Array($stem_hypo); @{$stem_hypo{$x}} = remove_punc(@{$stem_hypo{$x}}); for my $i (0..$#{$token_hypo{$x}}) { my $current_token_hypo = lc($token_hypo{$x}[$i]); my $current_stem_hypo = lc($stem_hypo{$x}[$i]); my $current_lemma_hypo = lc($lemma_hypo{$x}[$i]); $MI_x[$i]=0; my $MI_token_hypo= my $MI_T = 0; if (exists $hash_es{$current_token_hypo}) { foreach $token_text (@token_text) { $token_text=lc($token_text); $MI_T=0; if (exists $hash_en{$token_text}) { $MI_T=MI($current_token_hypo,$token_text,\%hash_es +,\%hash_en); } $MI_token_hypo = $MI_token_hypo + $MI_T; } $MI_x[$i]=$MI_token_hypo/$#token_text; } elsif (exists $hash_es{$current_lemma_hypo}) { foreach $token_text (@token_text) { $MI_T=0; if (exists $hash_en{$token_text}) { $MI_T=MI($current_lemma_hypo,$token_text,\%hash_es +,\%hash_en); } $MI_token_hypo = $MI_token_hypo + $MI_T; } $MI_x[$i]=$MI_token_hypo/$#token_text; } elsif (exists $hash_es{$current_stem_hypo}) { foreach $token_text (@token_text) { $MI_T=0; if (exists $hash_en{$token_text}) { $MI_T=MI($current_stem_hypo,$token_text,\%hash_es, +\%hash_en); } $MI_token_hypo = $MI_token_hypo + $MI_T; } $MI_x[$i]=$MI_token_hypo/$#token_text; } } push @MI,mean(@MI_x); if ($x==0) {$MI=$MI_x[0];} elsif ($MI[$x] >= $MI[$x-1]) {$MI=$MI[$x];} } #$MI = sprintf("%.15f", $MI); $MI = $MI*1000000; $MI = sprintf("%.4f", $MI); print OUTPUT "$n\t$entailment\t$MI\n"; } close OUTPUT; #===================================================================== +=============================== # ***************** ALL FUNCTIONS AND SUBROUTINES ARE HERE ***** +***************************** = #===================================================================== +=============================== sub mean { return sum(@_)/@_; } sub to_hash { my %hash; my $file = shift; open(FILE, "<$file"); foreach my $l (<FILE>) { my ($ngram,$line) = split /\t/, $l; push(@{ $hash{$ngram} }, $line); } close FILE; return %hash; } sub MI { my ($string_es,$string_en,$hash_es,$hash_en)=@_; my @array_es= my @array_en = my @intersection = (); @array_es = @{$hash_es{$string_es}}; @array_en = @{$hash_en{$string_en}}; my $prob_es = ($#array_es+1)/6939873; my $prob_en = ($#array_en+1)/6939873; @intersection= Intersection(@array_es,@array_en); my $prob_es_en= ($#intersection+1)/6939873; $prob_es_en = ($prob_es_en + ($prob_es*$prob_en*0.1))/1.1; my $mi= $prob_es_en* log($prob_es_en/($prob_es*$prob_en)); return $mi; } sub Intersection { my (@array1,@array2)=@_; my @union = my @intersection = my @difference = (); my %count = (); foreach my $element (@array1, @array2) { $count{$element}++ } foreach my $element (keys %count) { push @union, $element; push @{ $count{$element} > 1 ? \@intersection : \@difference }, $e +lement; } return @intersection; } sub to_Array { my $string = shift; my @array; if (my @arraynodes = $string->get_nodelist) { @array = map($_->string_value, @arraynodes);} return @array; } sub remove_punc { my @array = @_; my @filtered; for my $i (0..$#array){ unless ($array[$i] =~ m/[[:punct:]]/ ){ push @filtered,$array[$i]; } } return @filtered; }

      Time to ask your systems administrator to check if there are any limitations in place. Check ulimit -a and man limits.conf.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://887455]
help
Chatterbox?
[chacham]: ought to be a check constraint.
[chacham]: there's a regexp function. if that can be used ina check() constraint...
[LanX]: Thanks, but no check constraints in MySQL :/
LanX 1 ..2 ..3 ..4 .. is erix missing ;-)
LanX is Discipulus reciting "Springtime for Hitler" ?
[choroba]: erix could tell you how to do it in Postgres
[LanX]: NOOOOOOOOOOOOOOO not again

How do I use this? | Other CB clients
Other Users?
Others chanting in the Monastery: (11)
As of 2017-03-30 15:06 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Should Pluto Get Its Planethood Back?



    Results (360 votes). Check out past polls.