Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

comment on

( #3333=superdoc: print w/replies, xml ) Need Help??

This problem was originally presented to me by a co-worker as such,

I have a set of 100 files, and I want to randomly choose 5 of them. However, I want to weight the selection of each file based upon the number of words in the file. More words == greater chance it will be randomly selected. The files could contain between 100 - 3,000 words. What's a good way to do this?

First of all, the approach I'm going to detail here is specific to this example, but can easily be adapted to any random selection of weighted values. It should scale very easily to very large data sets (number of files, in this case), with very large weighting information (number of words, in this case). This post is mostly pseudo-code and explanation, not a functional piece of code. This solution has probably been created by other people before.

Okay, for sake of example, we're going to start off with 5 files.

my @files = qw(file_a.txt file_b.txt file_c.txt file_d.txt file_e.txt) +;

And we want to randomly choose 2 of the files. But, we want to weight our selections based off of the number of words in the file. More words in a file == more likely the file will be chosen.

Your first thought might be to build an array of all of the words in all of the files, then pick a random index, and determine which file the word is in. Note - you would need to pre-cache which word at which index is associate with which file. For example, the word "the" could appear at file_a.txt or file_b.txt. So you can't just randomly choose index 3, see "the" there, and know which file it's in. You have to know that index 3 => the => file_a.txt.

This is the first optimization. The words don't matter, only which file they're in. So instead of storing the word at each point, you can just store the file name. At this point, you'll need to be able to count the number of words in each file to get your weighting information. This is left as an exercise to the reader - use your favorite word counting widget. For example's sake, we'll say you end up with this structure:

my %words_in_files = ( 'file_a.txt' => 10, 'file_b.txt' => 1, 'file_c.txt' => 3, 'file_d.txt' => 5, 'file_e.txt' => 10 );

Now you can build up an array where the first 10 elements are "file_a.txt", the next one is "file_b.txt" and so on. For simplicity's sake, we'll display each file as its trailing letter ("file_a.txt" becomes "a"). This way, we can see our data:

aaaaaaaaaabcccdddddeeeeeeeeee

This approach is fully functional, but doesn't scale well. In our original problem, we had 100 files, with up to 3,000 words each. This is potentially a 300,000 element array, and it's just gonna get bigger if you add more files or words within the files. Don't get me wrong - perl can do it, but there's a better way.

The key is to realize that most of the information in that array is redundant. We're storing "a" 10 times. Do we really need to? Instead, we'll build a different data structure. In this structure, we'll store the file name, and the index at which the file begins. Externally, we'll also store a count of the total number of words. We end up with a structure along these lines:

my @indexes_of_files = ( # terminology: index 0 == "file offset, index 1 == "file name" [ qw( 0 file_a.txt ) ], [ qw( 10 file_b.txt ) ], [ qw( 11 file_c.txt ) ], [ qw( 14 file_d.txt ) ], [ qw( 19 file_e.txt ) ], ); my $total_number_of_words = 29;

Feel free to use hashrefs instead of arrayrefs, they may be easier to work with. I used arrayrefs here for simplicity of code display in the example. Note that the order of the files in this data structure is arbitrary. Whatever order the files are assigned in this array is irrelevant, so long as their file offsets change as appropriate.

We now have a much more compact data structure. Our algorithm is easy - generate a random integer between 0 and $total_number_of_words - 1 (0..28, in this case). Let's say that we generated "15".

Next, you need to search through the @indexes_of_files array to find the greatest file offset that's less than our generated number. Since the file information is in sorted order, a binary search can zip through the data in no time. Implementing the binary search (or whatever) algorithm is another exercise left to the reader.

However you find your data, you'll discover that you're looking at array element 4, which has file offset 14, corresponding to "file_d.txt". Note that if you count off 15 ticks into the "aaaa...b...cc..." array drawn out above, you'll also reach a "d", corresponding to file_d.txt.

You have now successfully chosen your first file, so you need to set up for subsequent ones. This is a 3 step op. One is easy, one is expensive, and one is tedious. First, the easy step.

Subtract from $total_number_of_words the length of the file you just chosen. In this case, file_d.txt has a length of 5 words, so $total_number_of_words becomes 24. You can re-calculate this length now using the the index of the element you were at and the index of the next element, you can cache it into the data structure, you can look up in the hash created earlier. Dealer's choice. But you need the length.

The "expensive" operation is simply to splice out the element at index [4].

Finally, for all elements >= the one you've removed ([4]), subtract from their file offset the length of the file just removed (5, in this case). You'll end up with this data structure when you're done:

my @indexes_of_files = ( # terminology: index 0 == "file offset, index 1 == "file name" [ qw( 0 file_a.txt ) ], [ qw( 10 file_b.txt ) ], [ qw( 11 file_c.txt ) ], # THIS FILE WAS REMOVED [ qw( 14 file_d.txt ) ], [ qw( 14 file_e.txt ) ], #this file offset was 19, is now 14. ); my $total_number_of_words = 24; #previously 29

And blam-o, you're set up to choose your next file, it's as if file_d.txt never existed, and you can repeat ad infinitum until you've selected enough files out.

Considerations

  • With tremendously long lists of data (for this example, say thousands or millions of files), you need to do a splice on a large array, and then run through all higher elements to do a subtraction. You're just iterating over an array and doing a subtraction on an integer, but it's still O(n). There may be fancier ways to do this w/o splicing or changing offsets, but I was unable to come up with an elegant one. They all seemed fragile and complicated relative to just decrementing the indexes. YMMV.
  • This can be applied to any set of data that you need to randomly select a value from based on a weighting value.
  • For smaller data sets, it may be simpler to just use the "aaa...b...cc..." approach of an array that lists all the file names. But it doesn't scale as well.
  • There may be something that does this efficiently already on CPAN.

In reply to Efficiently selecting a random, weighted element by jimt

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?
    Username:
    Password:

    What's my password?
    Create A New User
    Chatterbox?
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others taking refuge in the Monastery: (6)
    As of 2019-11-22 08:34 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?
      Strict and warnings: which comes first?



      Results (110 votes). Check out past polls.

      Notices?