Monks and monkettes! I recently found myself wondering, what's the longest words in the dictionary (/usr/share/dict, anyway)?

This is easily found out, but it's natural to be interested not just in the longest word but (say) the top ten. And when your dictionary contains (say) eight words of length fifteen and six words of length fourteen, it's also natural to not want to arbitrarily select two of the latter, but list them all.

I quickly decided I needed a type of list that would have a concept of the fitness of an item (not necessarily the length of a word), and try not to exceed a maximum size if possible (while retaining some flexibility). My CPAN search-fu is non-existent, but since it sounded like fun, I just rolled my own. Here's the first stab at what is right now called List::LimitedSize::Fitness (if anyone's got a better idea for a name, please let me know):

# FIXME: name is preliminary package List::LimitedSize::Fitness; # standard preliminaries use strict; use warnings; # core modules use Carp; use List::Util qw/min max/; # we don't export anything, but Exporter also handles module versions. require Exporter; our @ISA = qw/Exporter/; our $VERSION = 1; ### CONSTRUCTOR ### # FIXME: should the policy be passed in as a string, or some other way +? sub new { my $class = shift; my $maximum_size = shift; my $policy = shift // "flexible"; # default to "flexible" # sanity check: sizes <= 0 don't make sense. croak "maximum size must be > 0" unless($maximum_size > 0); # sanity check: ensure a valid policy was passed. croak "invalid policy: $policy" unless($policy eq "strict" or $policy eq "flexible"); # create instance data my $self = { 'maximum-size' => $maximum_size, 'policy' => $policy, 'list' => {}, 'size' => 0, }; # bless instance data into class and return the new object bless $self, $class; return $self; } ### FITNESS ### # returns the worst fitness on the list. sub worst_fitness { my ($self) = @_; return min $self->fitnesses(); } # returns the best fitness on the list. sub best_fitness { my ($self) = @_; return max $self->fitnesses(); } # returns a list of all fitnesses on the list in list context, or the +number of fitnesses in scalar context. sub fitnesses { my ($self) = @_; return wantarray ? keys %{ $self->{'list'} } : scalar keys %{ $self->{'list'} }; } ### ITEMS ### # returns a list containing all items on the list. sub _allitems { my ($self) = @_; # list of all items my @allitems = (); # iterate through fitnesses and copy items. foreach my $fitness ($self->fitnesses()) { push @allitems, $self->items($fitness); } # return collected items. return @allitems; } # returns items on the list in list context, or the number of items on + the # list in scalar context, restricted to the given fitness if any. sub items { my ($self, $fitness) = @_; # don't do any work in void context return unless defined wantarray; # fitness specified? if(defined $fitness) { # if this fitness doesn't exist at all, return the empty list +or 0. unless(exists $self->{'list'}->{$fitness}) { return wantarray ? () : 0; } # fitness exists, return the right data return wantarray ? @{ $self->{'list'}->{$fitness} } : scalar @{ $self->{'list'}->{$fitness} }; } else { # no fitness specified, caller wants all items on list return wantarray ? $self->_allitems() : $self->{'size'}; } } ### ADDING ITEMS ### # add an item to the list with the given fitness # FIXME: should undefined items be forbidden? # FIXME: should we ensure $fitness is a number? sub add { my ($self, $item, $fitness) = @_; croak "Fitness undefined for item $item" unless defined $fitness; if($self->{'policy'} eq "flexible") { $self->_flexible_add($item, $fitness); } else { $self->_strict_add($item, $fitness); } } # add an item to the list with the given fitness, using the "strict" p +olicy. sub _strict_add { my ($self, $item, $fitness) = @_; # easy case: item fits on the list, just put it there. if($self->{'size'} < $self->{'maximum-size'}) { $self->_put($item, $fitness); # item does not fit on list. } else { my $worst_fitness = $self->worst_fitness(); if($fitness < $worst_fitness) { # if fitness is worse than current worst, do nothing. return; } elsif($fitness == $worst_fitness) { # if fitness matchs current worst, do nothing. return; } else { # if fitness exceeds current worst, add item... $self->_put($item, $fitness); # ...and delete one item from worst fitness class. $self->_pop($worst_fitness); } } } # add an item to the list with the given fitness, using the "flexible" + policy. sub _flexible_add { my ($self, $item, $fitness) = @_; # easy case: item fits on the list, just put it there. if($self->{'size'} < $self->{'maximum-size'}) { $self->_put($item, $fitness); # item does not fit on list. } else { my $worst_fitness = $self->worst_fitness(); if($fitness < $worst_fitness) { # if fitness is worse than current worst, do nothing. return; } elsif($fitness == $worst_fitness) { # if fitness matchs current worst, simply add item. $self->_put($item, $fitness); } else { # if fitness exceeds current worst, add item... $self->_put($item, $fitness); # ...and delete worst fitness class if we safely can -- i. +e. if we # can be sure that doing so doesn't open us up to the # possibility of new members being added to it later on. my $num_worst = $self->items($worst_fitness); if($self->{'size'} - $num_worst >= $self->{'maximum-size'} +) { $self->{'size'} -= $self->items($worst_fitness); delete $self->{'list'}->{$worst_fitness}; } } } } # put an item on the list without any further checks. sub _put { my ($self, $item, $fitness) = @_; push @{ $self->{'list'}->{$fitness} }, $item; $self->{'size'}++; } ### REMOVING ITEMS ### # Note: there is no remove(), since removal of items from the list by +the # user is explicitely not supported. (For now, unless/until someone # provides a use case, and ideally a patch.) # remove an item with the given fitness from the list (and return it). # No guarantees as to which item you'll get. sub _pop { my ($self, $fitness) = @_; # sanity check: fitness class must exist and not be empty return unless exists $self->{'list'}->{$fitness}; return unless scalar $self->items($fitness); # remove and remember item my $item = pop @{ $self->{'list'}->{$fitness} }; # adjust list size $self->{'size'}--; # if the class the item came from is empty now, remove it entirely delete $self->{'list'}->{$fitness} unless $self->items($fitness); # return removed item return $item; } # obligatory success! 1;

This features both "flexible" and "strict" policies. With the former, fitness classes are guaranteed to never lose items, but the list as a whole might grow beyond the specified maximum size. With the latter, the list is guaranteed to never grow beyond the specified maximum size, but fitness classes might lose items. (Obviously you cannot have it both ways, not in general.)

Here's an example of the whole thing in action:

#!/usr/bin/perl use Modern::Perl '2014'; use English; # make sure STD* uses UTF-8 use open IO => ':encoding(UTF-8)', ':std'; use FindBin; use lib $FindBin::Bin; use List::LimitedSize::Fitness; ### MAIN ### $OUTPUT_AUTOFLUSH = 1; my $num_words = 10; my $list = List::LimitedSize::Fitness->new($num_words, "flexible" +); while(<>) { chomp; print "." unless $INPUT_LINE_NUMBER % 10000; $list->add($_, length $_); } say ""; foreach my $fitness (sort { $a <=> $b } $list->fitnesses()) { say "length $fitness"; say join "\n", map { s/_/ /g; "\t" . $_ } $list->items($fitness); } say $list->items() . " words total ($num_words requested).";

This might output (depending on your dictionary):

$ perl longestwords.pl wordsEn.txt .......... length 21 antienvironmentalists antiinstitutionalists counterclassification electroencephalograms electroencephalograph electrotheraputically gastroenterologically internationalizations mechanotheraputically microminiaturizations microradiographically length 22 counterclassifications counterrevolutionaries electroencephalographs electroencephalography length 23 disestablismentarianism electroencephalographic length 25 antidisestablishmentarian length 28 antidisestablishmentarianism 19 words total (10 requested). $

If you've got any thoughts, tips, comments, rotten tomatoes etc., send them my way! (...actually, forget about the rotten tomatoes.)

Also, does anyone think this module would be useful to have on CPAN, in principle if not in its current state?

Replies are listed 'Best First'.
Re: Size-limited, fitness-based lists
by jdporter (Canon) on Aug 09, 2015 at 04:00 UTC

    This is when a little schooling comes in handy. ;-)

    What you want is a Heap.

    One nice module for heaps is Heap::Simple. Et voil:

    use Heap::Simple; use strict; use warnings; # NB: will throw an exception for a variety of reasons, including # if the source produces fewer lines than the 'min' specified. sub get_N_longest { my( $min, $infile ) = @_; open my $infh, '<', $infile or die; my $heap = Heap::Simple->new( elements => 'Any', order => '>' ); local $_; # be nice to the caller. while ( <$infh> ) { chomp; $heap->key_insert( length($_), $_ ); } my @r; push @r, $heap->extract_upto( length $heap->top ) # top throws if +heap empty while @r < $min; @r } printf "%6d %s\n", length($_), $_ for get_N_longest(10,'words.txt');
    I reckon we are the only monastery ever to have a dungeon stuffed with 16,000 zombies.

      Ah, thank you! Heaps sound like a useful data structure for this problem. (It really shows that I don't have a CS background; doesn't it? I don't know the lingo, or even many of the concepts.)

      That said -- wouldn't a heap grow to include all items first before you extract the relevant ones? Depending on how large your data is and how much memory you can or cannot afford to throw at it, I can see advantages to my approach, too.

      Heap::Simple doesn't work for me, unfortunately: it requires either Heap::Simple::XS or Heap::Simple::Perl, and both fail their test suites.

        Heap::Simple doesn't work for me, unfortunately: it requires either Heap::Simple::XS or Heap::Simple::Perl, and both fail their test suites.

        Nice defeatist attitude. I strongly suspect that the test failures have almost nothing to do with the functionality of the module and that if you simply "force" the install, that Heap::Simple will work fine and be quite useable for you.

        Two minutes of investigation certainly reinforces this impression for me. The complexity of getting the unit tests to run correctly is much greater than and very different from the complexity of making the module functional, especially in this case.

        Update: And a quick glance at the test results leads me to suspect that just using version 0.11 would even lead to the unit tests just working. Update2: And a look at the Changes shows that there are likely no feature differences between 0.11 and the latest version.

        - tye        

        wouldn't a heap grow to include all items first before you extract the relevant ones

        Yes, but I don't see a way to avoid that and yet satisfy your requirement to include all of the members of a class even if that causes the "minimum" to be exceeded. For example, what do you think we should do if the input list consists of a million words all of exactly the same length? If the "size-limited" aspect of the need is strong enough to make us discard some of that set of values, then you need to specify your "cut-off" heuristics somehow.

        I reckon we are the only monastery ever to have a dungeon stuffed with 16,000 zombies.
Re: Size-limited, fitness-based lists
by toolic (Bishop) on Aug 09, 2015 at 01:00 UTC
    This reminds me of the time long ago that I tried to craft a maximum single play for Scrabble. The width of the board is 15 letters, and I pretended the 3 triple-word squares were empty and I was able to use all 7 of my tiles. I think the word I came up with was "aggrandizements", but I never got around to using a computer to maximize this.
Re: Size-limited, fitness-based lists
by BrowserUk (Pope) on Aug 10, 2015 at 00:52 UTC

    Have you considered something like this for your application?:

    #! perl -slw use strict; sub bestN { my( $cmp, $n, $ref ) = @_; my $src = $ref; if( ref $ref eq 'ARRAY' ) { my $t = 0; $src = sub{ $t < @{ $ref } ? $ref->[ $t++ ] : undef }; } my $_cmp = sub { local( $::a, $::b ) = @_; $cmp->() }; my @top = sort $cmp map $src->(), 1 .. $n; while( defined( $_ = $src->() ) ) { next if $_cmp->( $top[ $#top ], $_ ) <= 0; my $p = $#top; while( $p > 0 ) { $_cmp->( $top[ --$p ], $_ ) <= 0 and last; } splice @top, $p, 0, $_; pop @top; } return @top; } my @in = 1 .. 100; print for bestN sub{ $::b <=> $::a }, 10, \@in; open DICT, '<', $ARGV[0] or die $!; print for bestN sub{ length( $::b ) <=> length( $::a ) }, 10, sub{ loc +al $^W; chomp( $_ = <DICT> ); $_ };;

    Output:

    C:\test>topN words.txt 100 99 98 97 96 95 94 93 92 91 abstractionisms abstractionists acanthocephalan acceptabilities acceptingnesses accessibilities acclimatization accommodatingly accommodational absorbabilities

    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". I knew I was on the right track :)
    In the absence of evidence, opinion is indistinguishable from prejudice.
    I'm with torvalds on this Agile (and TDD) debunked I told'em LLVM was the way to go. But did they listen!