Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

Limit the size of a hash

by Dr Manhattan (Beadle)
on Sep 05, 2013 at 06:51 UTC ( #1052498=perlquestion: print w/ replies, xml ) Need Help??
Dr Manhattan has asked for the wisdom of the Perl Monks concerning the following question:

Hi all

I am trying to process a very large file and only storing some of the data. The file contains a code and a score. A simple example looks like this:

0 5 3 7 0.92 1 5 4 8 0.12 6 8 4 6 0.54

With the code on the left had side and its score on the right. I want to keep only the elements with the 10 highest scores. However the file that I am trying to process is more than 8gb, so I don't want to put the entire thing into the hash. I there a way to limit the size of a hash? Or a way that I keep only the 10 highest values and discard everything else

Thanks in advance for any help, much appreciated

Comment on Limit the size of a hash
Download Code
Re: Limit the size of a hash (use an array)
by Anonymous Monk on Sep 05, 2013 at 07:01 UTC

    1) use an array (say presized to 12k) 2) when you've read in 12k elements 3) sort the array 4) and discard all but the highest 10 ( splice)

    Then put it in a hash or whatever you need to do

Re: Limit the size of a hash
by Corion (Pope) on Sep 05, 2013 at 07:02 UTC

    How about just keeping the first 10 elements, and whenever you find a new element, just keeping the 10 highest scoring elements? That way you have never more than 11 elements in memory.

Re: Limit the size of a hash
by Random_Walk (Parson) on Sep 05, 2013 at 08:26 UTC
    use strict; use warnings; use Data::Dumper; my $keep = 5; my @keeper; # populate our store with the first $keep vals for (1 .. $keep) { my @vals = split /\s+/, <DATA>; my $score = pop @vals; push @keeper, [$score, \@vals]; } # Sort it so the smallest is at the end @keeper = sort {$a->[0] <=> $b->[0]} @keeper; print "$_->[0], " for @keeper; while (<DATA>) { my @vals = split; next unless @vals; # could do a more sophisticated test my $score = pop @vals; if ($score > $keeper[0]->[0]) { $keeper[0] = [$score, \@vals]; @keeper = sort {$a->[0] <=> $b->[0]} @keeper; print "\nAdded $score: now have: "; print "$_->[0], " for @keeper; } } __DATA__ 0 5 3 7 0.97 1 5 4 8 0.18 1 5 4 8 0.21 1 5 4 8 0.22 1 5 4 8 0.52 1 5 4 8 0.16 1 5 4 8 0.18 1 5 4 8 0.72 1 5 4 8 0.32 1 5 4 8 0.17 1 5 4 8 0.52 1 5 4 8 0.92 6 8 4 6 0.54

    Cheers,
    R.

    Pereant, qui ante nos nostra dixerunt!
Re: Limit the size of a hash
by space_monk (Chaplain) on Sep 05, 2013 at 08:42 UTC
    You could use a package such as Tie::Array::Sorted to maintain a sorted array of your 10 highest scores (mapped to the corresponding code); once the array increases in size to above 10, just remove the lowest (last) item on the array as you go through the file.
    The following needs a little refinement/ debugging, but shows an outline of how I think it should work....
    use Tie::Array::Sorted; tie @a, "Tie::Array::Sorted", sub { $_[0]->{score} <=> $_[1]->{score} +}; # read file line by line while (<FH>) { # use split if you hate regexs... :-) if ( /((\d ){4})(\d.\d+)/) { # add new element to array push @a, { 'score' => $2, 'code' => $1 }; # remove lowest scoring element if more than 10 if ($#a > 10) { pop @a; # remove lowest scoring/last element } } }
    If you spot any bugs in my solutions, it's because I've deliberately left them in as an exercise for the reader! :-)
Re: Limit the size of a hash
by Tux (Monsignor) on Sep 05, 2013 at 08:49 UTC

    If you need to analyze the data more often than once and speed is not the main issue, but size is, put the hash in a database. Use something like select … from hashtable order by … limit 10;. You can fill the database with e.g. Tie::Hash::DBD and have the best of both worlds.


    Enjoy, Have FUN! H.Merijn
Re: Limit the size of a hash
by BrowserUk (Pope) on Sep 05, 2013 at 09:37 UTC

    There is an anomaly in what you are seeking to do.

    And 8GB file of 12 character lines gives ~715 million lines.

    With the selection value being a value between 0.00 & 0.99, there are only 100 possible values; which means that there could be an average of ~7 million of each value. Which 10 of the 7 million 0.99 value liens do you want?


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

        Or perhaps the OP actually wants all the records that contain the top 10 values.

        Eg. ~7 million * 0.99; ~7 million * 0.98; ...


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

      That assumes a flat distribution. If it was normally distributed with a sufficiently small standard deviation (0.1 would work, according to the back of my envelope) then talking about the top 10 would make sense.

Re: Limit the size of a hash
by brx (Pilgrim) on Sep 05, 2013 at 09:38 UTC

    Similar to Random_Walk version (Re: Limit the size of a hash), but without sort.

    Here, we keep five elements.

    If necessary, we calculate the index of the lowest element (min of highest).

    If score of the new element is better than "min of highest", "min of highest" is replaced by new element.

    use strict; use warnings; my $keep = 5; #init my @highest = map scalar<DATA>,1..$keep; #fisrt lines my @scores = map +(split)[-1],@highest; #score is last element (-1) of + each line my $indexofmin = 0; my $updated = 1; #### while(my $line=<DATA>) { my $score = (split/ /,$line)[-1]; if ($updated) { for my $i (0..$#scores) { $indexofmin = $i if $scores[$i]<$scores[$indexofmin]; } } if ($score>$scores[$indexofmin]) { $highest[$indexofmin]=$line; $scores[$indexofmin]=$score; $updated = 1; } else { $updated = 0; } } print @highest; __DATA__ 0 5 3 7 0.97 1 5 4 8 0.18 1 5 4 8 0.21 1 5 4 8 0.22 1 5 4 8 0.52 1 5 4 8 0.16 1 5 4 8 0.18 1 5 4 8 0.72 1 5 4 8 0.32 1 5 4 8 0.17 1 5 4 8 0.52 1 5 4 8 0.92 6 8 4 6 0.54
    English is not my mother tongue.
    Les tongues de ma mère sont "made in France".
Re: Limit the size of a hash
by BrowserUk (Pope) on Sep 05, 2013 at 10:40 UTC

    Based on the assumption that you're using a hash because you want all the records containing each of the top 10 values, this uses two passes. One to extract and subset the top 10 values. The second to print the records containing those values:

    #! perl -sw use strict; open FH, '<', $ARGV[ 0 ] or die $!; ## First pass; find all the values my %vals; m[(\S+)$] and ++$vals{ $1 } while <FH>; ## Select the top 10 and put them in a hash for fast lookup. my %top10; undef @top10{ ( sort{ $b <=> $a } keys %vals )[ 0 .. 9 ] }; ## rewind for second pass seek FH, 0, 0; ## rewind ## And print them if they fit the criteria m[(\S+)$] and exists $top10{ $1 } and print while <FH>; ## close close FH;

    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Limit the size of a hash
by hdb (Prior) on Sep 05, 2013 at 11:20 UTC

    And here my version based on insertion sort, complete with random data generation and testing:

    use strict; use warnings; sub nextrecord { return undef unless shift; # generate random records return { code => join( " ", (int rand 10), (int rand 10), (in +t rand 10), (int rand 10) ), score => rand }; } my $keep = 10; my $counter = 1000; # initialize with the first 10 records sorted my @largest = sort { $a->{score} <=> $b->{score} } map nextrecord( $co +unter-- ), 1..$keep; my @test = @largest; # for testing while( my $next = nextrecord( $counter-- ) ){ push @test, $next; # keep all for testing my $pos = 0; # insertion sort while(($pos < $keep) && ($largest[$pos]->{score} < $next->{sco +re}) ) { $largest[$pos-1] = $largest[$pos] if $pos; $pos++; } $largest[$pos-1] = $next if $pos; } print map { "$_->{code}\t$_->{score}\n" } @largest; # test print "\n"; print map { "$_->{code}\t$_->{score}\n" } ( sort { $a->{score} <=> $b- +>{score} } @test )[-10..-1];

      I just wrote my own with insertion sort. Looks like you beat me to it. This one no longer needs to initialise and sort the store array with real data, it just fills it with the correct number of zero values.

      use strict; use warnings; use Data::Dumper; my $keep = 5; my @keeper; push @keeper, [0] for 1 .. $keep; while (<DATA>) { my @vals = split; next unless @vals; # possibly more sophisticated test my $score = pop @vals; addlist ([$score, \@vals], \@keeper); } print Dumper \@keeper; # slide up the list knocking down existing records # until we no longer have the biggest sub addlist { my ($rec, $list) = @_; for my $i (0 .. $#$list) { if ($rec->[0] > $list->[$i]->[0]) { $list->[$i-1] = $list->[$i] if $i; $list->[$i] = $rec; } else { last; } } } __DATA__ 0 5 3 7 0.97 1 3 4 8 0.18 1 4 4 8 0.21 1 7 4 4 0.22 1 8 4 8 0.52 1 9 4 1 0.16 1 5 1 8 0.18 1 5 2 9 0.72 1 5 5 8 0.32 1 5 6 6 0.17 1 5 7 4 0.52 1 5 9 3 0.92 1 5 4 8 0.99 6 8 4 6 0.54

      Cheers,
      R.

      Pereant, qui ante nos nostra dixerunt!

        I am usually careful with default values. Here you implicitely assume that all scores are positive, otherwise the zeroes could influence the result...

Re: Limit the size of a hash (beware of 'add one and (re)sort and discard' algorithms)
by BrowserUk (Pope) on Sep 05, 2013 at 14:25 UTC

    If it is just a randomly selected 10 items with the highest value you are after, beware of 'add one and (re)sort and discard' algorithms.

    O(logN) sorting algorithms are pretty efficient for large sets of data, but (relatively) pretty lousy at small sets.

    Eg. log(100e6) = 18.42 => 0.000018%; but log(10) = 23%.

    So, given your ~715e6 records, the 'add 1, resort and discard 1' algorithms do 715 million * O(log 11) sorts, which is ~= O(745e6).

    Conversely, if you add 1000 items, sort and discard 1000, you get 715 thousand * O(log 1010) sorts, which is ~= O(10e6).

    And (theoretically), if you add 1e6 items, sort, and disarcd 1e6 then you get 715 * O(log 1000010) sorts, which is ~= O(9878).

    Of course, these CompSci theoretical complexity estimates tend to ignore the cost of memory allocations, deallocations, etc. and thus never work out that way in reality, but they do give an indication of where to look for savings.

    This compares Add 1 and Add many algorithms for various values of many, and adding somewhere between 100 and 1000 items at a time seems to work best for an entirely in-memory, cpu-bound process:

    #! perl -slw use strict; use List::Util qw[ sum min ]; use Time::HiRes qw[ time ]; sub timeit(&) { my $soFar = sum( times() ); my $start = time; $_[0]->(); return sprintf "Wall:%6.3f secs; cpu:%6.3f secs", time() - $start, sum( times() ) - $soFar; } our $T //= 10e6; our $N //= 10; our $B //= 10; my @data = map{ sprintf "%4.2f", rand() } 1 .. 10e6; print 'Add 1 and sort; add 1 and sort ...: ', timeit { my @top = sort{ $b <=> $a } @data[ 0 .. $N-1 ]; for( my $i = $N; $i < $T; ++$i ) { @top = ( sort{ $b <=> $a } @top, $data[ $i ] )[ 0 .. $N-1 ]; } # print "@top"; }; print "Add $B and sort; add $B and sort ...: ", timeit{ my @top = sort{ $b <=> $a } @data[ 0 .. $N-1 ]; for( my $i = $N; $i < $T; $i += $B ) { @top = sort{ $b <=> $a } @top, @data[ $i .. min( $i+$B, $#data + ) ]; $#top = $N; } # print "@top"; }; __END__ C:\test>1052498-3 -T=10e6 -B=1e1 Add 1 and sort; add 1 and sort ...: Wall:89.400528 secs; cpu:89.04 +7000 secs Add 1e1 and sort; add 1e1 and sort ...: Wall:18.483192 secs; cpu:18.39 +1000 secs C:\test>1052498-3 -T=10e6 -B=1e2 Add 1 and sort; add 1 and sort ...: Wall:87.353784 secs; cpu:86.43 +8000 secs Add 1e2 and sort; add 1e2 and sort ...: Wall:9.385142 secs; cpu:9.2340 +00 secs C:\test>1052498-3 -T=10e6 -B=2e2 Add 1 and sort; add 1 and sort ...: Wall:87.380 secs; cpu:87.046 s +ecs Add 2e2 and sort; add 2e2 and sort ...: Wall: 9.135 secs; cpu: 9.156 s +ecs C:\test>1052498-3 -T=10e6 -B=1e3 Add 1 and sort; add 1 and sort ...: Wall:87.436786 secs; cpu:86.62 +5000 secs Add 1e3 and sort; add 1e3 and sort ...: Wall:9.377329 secs; cpu:9.3290 +00 secs C:\test>1052498-3 -T=10e6 -B=1e4 Add 1 and sort; add 1 and sort ...: Wall:87.435 secs; cpu:86.298 s +ecs Add 1e4 and sort; add 1e4 and sort ...: Wall:10.077 secs; cpu: 9.843 s +ecs

    Now, if someone decides to try that where the data is greater than memory and so coming from an external file, they'll probably find that the IO-limited nature means that 1 or 1000 at a time make little or no difference to the overall elapsed time.

    But, if you then look at the CPU used, the 1 at a time will be an order of magnitude, or more, higher. And cycles cost money. There is little purpose in burning extra cycles to achieve the same outcome in the same time.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      Yes, but if you first compare the new item with the smallest of the 10 already stored, you will very rarely add it to the set of 10 and re-sort. This would be a "conditionally discard, add, and re-sort" algorithm.

        Yes. Your hand-coded insertion sort fairs much better than those using the built-in sort.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        Indeed. But I'm still not at all convinced that the OP is actually after a simple top 10 from 715 million records.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1052498]
Approved by vinoth.ree
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others cooling their heels in the Monastery: (8)
As of 2014-11-26 13:04 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My preferred Perl binaries come from:














    Results (171 votes), past polls