http://www.perlmonks.org?node_id=988127


in reply to Re^2: Get random unique lines from file
in thread Get random unique lines from file

BrowserUk:

Ah, well, I know jack about FASTA files, so I didn't consider that. Of course, by changing the reader to accumulate records instead of lines, it could be adapted. Though since there are already a couple working examples from you and Marshall, and since mine has a bias in it, there's no real reason to do so.

I know that *you* know how to do the changes, but if someone tripping across this node in the future wants to do it, you can do so something (untested!) like this:

my @record; while (<$FH>) { if (/start of record marker/) { ++$cnt_recs; if ($num/$cnt_recs > rand) { my $i=@samples; if ($i > $num) { $i = rand @samples; } $samples[$i]=[$cnt_recs, [@record]]; } } else { # Accumulate record push @record, $_; } }

...roboticus

When your only tool is a hammer, all problems look like your thumb.

Replies are listed 'Best First'.
Re^4: Get random unique lines from file
by Marshall (Canon) on Aug 20, 2012 at 05:16 UTC
    I don't think that this is the best way...
    BrowserUk and I both used the core module: List::Util::shuffle;

    He understood the FASTA format better than I did and that is fine given that the format of the OP's question was hard to "decode".

    The main point is that is that this core shuffle() function works very well, is very fast (a core function that is implemented in 'C') and who's interface is easy to understand. I recommend using it rather than trying to "roll your own".

    Oh, BTW, "Core Function" means that this is available on all Perl systems as part of the language - no "extra module installation" is required. .... Well I don't know exactly about "all", but I figure since Perl 5.6 (for more than decade).

      Marshall:

      I wasn't really worried whether it was the best way or not, nor whether it used modules or not. I was just amused by the technique for getting a single random line from a file with equal probability, and wanted to generalize it so I could use it for multiple lines.

      Unfortunately, I haven't come up with any ideas that don't introduce a bias. (I haven't thought about it really hard for the last few days, but I've given it occasional thought during my daily commutes.)

      I might be able to come up with something if I would sit down and analyze the probabilities, but it's not quite interesting enough to work *that* hard on it! ;^)

      ...roboticus

      When your only tool is a hammer, all problems look like your thumb.

        Unless you are trying to game the tables in "Lost Wages" or do scientific calculations, this doesn't matter at all.
        Don't worry about it if it doesn't matter.