Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??

I show a one-pass solution to this problem using the combinatorical algorithm. Here one-pass means that you need only O(gm) memory if you want to print g words and the maximal word length is m, you have to read the file only once and don't even know the number of words in the dictionary in advance. Apart from this, I don't take emphasis on that the algorithm doesn't take too much computation time. That could also be easily done (while still keeping the previous efficency conditions true). For that, see algorithm R in chapter 3.4.2 in Knuth, but I leave the implementation as an exercise to the reader.

You didn't say if there's any requirement on the order of the words printed, so I assume it can be anything (whatever is simplest to implement). I'll also assume that if there's fewer than 100 words starting with a certain letter, we have to print all of them. And naturally assume the usual disclaimer for the code: I put this together fast and it may have errors.

As a simpler example, I first show how to just select 100 words uniformly randomly from a dictionary, independently of first letters.

use warnings; use strict; my $g = 100; my @c; my $n = 0; while(<>) { i +f (rand() < $g / ++$n) { splice @c, int(rand(@c)), $g <= @c, $_; } } +print for @c;
Now doing this for every letter we get this:
use warnings; use strict; my $g = 100; my %c; my %n; while(<>) { my $l + = /(.)/ && lc($1); my $c = \@{$c{$l}}; if (rand() < $g / ++$n{$l}) { + splice @$c, int(rand(@$c)), $g <= @$c, $_; } } print @$_ for values( +%c);

Update. Another one-pass solution would be to use heaps. You create a heap for each letter, add words as you read them to the corresponding heap using a random number as priority, and popping an element if the heap is larger than 100. I guess that this would be less CPU-efficent as the above mentioned good algorithm in Knuth if well implemented.

Update 2008 oct 9: see also Randomly select N lines from a file, on the fly.

Update 2009-12-26: see also Random sampling a variable record-length file. which – by the time you look there – should have some good solutions as well.


In reply to Re: improving the efficiency of a script (random sample) by ambrus
in thread improving the efficiency of a script by sulfericacid

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?
    Username:
    Password:

    What's my password?
    Create A New User
    Domain Nodelet?
    Chatterbox?
    and the web crawler heard nothing...

    How do I use this?Last hourOther CB clients
    Other Users?
    Others scrutinizing the Monastery: (3)
    As of 2025-06-23 03:23 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?

      No recent polls found

      Notices?
      erzuuliAnonymous Monks are no longer allowed to use Super Search, due to an excessive use of this resource by robots.