Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??

Yes, it's called "random sample" indeed, you've got the right keyword so you just have had to search and you'd have found this excellent past thread: improving the efficiency of a script.

(Short answer, because my reply there isn't clear: if you need a sample of k records, take an array holding k records, initialize it with the first k records of your file; then reading the rest of the file sequentially, and for each record, if its (zero-based one-based) index in the file is n, roll a dice of n sides, and if it lands on one of the firs k sides, replace the element of that index in your array with that record.)

Update: sorry, above procedure is wrong, you've got to take a dice whose number of sides is the one-based index of the record in the file.

To make this clearer, here's some code. Records are one per line, first command line argument is number of samples you need. I assumed throughout this node that you want samples without repetition and that the order of samples don't matter. (Before you ask, yes, I do know about $. and even use it sometimes.)

perl -we 'my $k = int(shift); my @a; my $l = 0; while (<>) { if ($l++ +< $k) { push @a, $_ } else { if ((my $j = rand($l)) < $k) { $a[$j] = +$_; } } } print @a;' 3 filename

Update: It's easy to make an error in these kinds of things, so you have to test them. Below shows that you get all 20 possible choices of 3 out of 6 with approximately equal frequency, so we can hope it's a truly uniform random choice.

$ cat a one two three four five six $ (for x in {1..33333}; do perl -we 'my $k = int(shift); my @a; my $l += 0; while (<>) { if ($l++ < $k) { push @a, $_ } else { if ((my $j = +rand($l)) < $k) { $a[$j] = $_; } } } print @a;' 3 a | sort | tr \\n \ + ; echo; done) | sort | uniq -c | sort -rn 1747 five four three 1736 five one six 1735 five three two 1725 four three two 1707 one six three 1695 five four two 1685 five six three 1684 five six two 1678 one three two 1666 five four six 1663 four six two 1663 four one six 1663 five four one 1645 four one two 1640 four six three 1637 five one three 1616 five one two 1592 six three two 1578 one six two 1578 four one three $

In reply to Re: Random sampling a variable length file. by ambrus
in thread Random sampling a variable record-length file. by BrowserUk

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others cooling their heels in the Monastery: (4)
As of 2024-05-25 05:35 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found