Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

Re: Size-limited, fitness-based lists

by jdporter (Paladin)
on Aug 09, 2015 at 04:00 UTC ( [id://1137950]=note: print w/replies, xml ) Need Help??


in reply to Size-limited, fitness-based lists

This is when a little schooling comes in handy. ;-)

What you want is a Heap.

One nice module for heaps is Heap::Simple. Et voilą:

use Heap::Simple; use strict; use warnings; # NB: will throw an exception for a variety of reasons, including # if the source produces fewer lines than the 'min' specified. sub get_N_longest { my( $min, $infile ) = @_; open my $infh, '<', $infile or die; my $heap = Heap::Simple->new( elements => 'Any', order => '>' ); local $_; # be nice to the caller. while ( <$infh> ) { chomp; $heap->key_insert( length($_), $_ ); } my @r; push @r, $heap->extract_upto( length $heap->top ) # top throws if +heap empty while @r < $min; @r } printf "%6d %s\n", length($_), $_ for get_N_longest(10,'words.txt');
I reckon we are the only monastery ever to have a dungeon stuffed with 16,000 zombies.

Replies are listed 'Best First'.
Re^2: Size-limited, fitness-based lists
by AppleFritter (Vicar) on Aug 09, 2015 at 09:39 UTC

    Ah, thank you! Heaps sound like a useful data structure for this problem. (It really shows that I don't have a CS background; doesn't it? I don't know the lingo, or even many of the concepts.)

    That said -- wouldn't a heap grow to include all items first before you extract the relevant ones? Depending on how large your data is and how much memory you can or cannot afford to throw at it, I can see advantages to my approach, too.

    Heap::Simple doesn't work for me, unfortunately: it requires either Heap::Simple::XS or Heap::Simple::Perl, and both fail their test suites.

      Heap::Simple doesn't work for me, unfortunately: it requires either Heap::Simple::XS or Heap::Simple::Perl, and both fail their test suites.

      Nice defeatist attitude. I strongly suspect that the test failures have almost nothing to do with the functionality of the module and that if you simply "force" the install, that Heap::Simple will work fine and be quite useable for you.

      Two minutes of investigation certainly reinforces this impression for me. The complexity of getting the unit tests to run correctly is much greater than and very different from the complexity of making the module functional, especially in this case.

      Update: And a quick glance at the test results leads me to suspect that just using version 0.11 would even lead to the unit tests just working. Update2: And a look at the Changes shows that there are likely no feature differences between 0.11 and the latest version.

      - tye        

        There's defeatist and then there's beating a dead horse. I'm curious what told you that the test results [c|sh]ould be ignored in this case?

        Dum Spiro Spero
      wouldn't a heap grow to include all items first before you extract the relevant ones

      Yes, but I don't see a way to avoid that and yet satisfy your requirement to include all of the members of a class even if that causes the "minimum" to be exceeded. For example, what do you think we should do if the input list consists of a million words all of exactly the same length? If the "size-limited" aspect of the need is strong enough to make us discard some of that set of values, then you need to specify your "cut-off" heuristics somehow.

      I reckon we are the only monastery ever to have a dungeon stuffed with 16,000 zombies.
        wouldn't a heap grow to include all items first before you extract the relevant ones
        Yes

        In general, no. The reason I use a heap for "top 10 of a huge list" is because you can get that result from a heap that never gets above a size of 10. If you allow for ties, then you only have to let the heap grow beyond 10 based on how many items are currently tied for 10th place, which usually means that the heap only ever gets slightly longer than 10 items.

        Start with an empty heap configured to keep the worst item at the head. For each item, if there are fewer than 10 items in the heap, then push the next item in. Else, if the item is worse than the item at the head of the heap, then simply discard the item. Else, if the new item is tied with the worst item, then push it onto a separate stack of tied items. Else, the new item is better than the worst item so replace the worst item with the new item and push it down, then compare the new worst item (the new head of the heap) with the item you just removed. If the new head is better than the old head, then discard the old head and empty the list of tied items. Otherwise, push the old head onto the list of tied items.

        - tye        

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1137950]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others scrutinizing the Monastery: (6)
As of 2024-04-16 17:34 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found