comment on

Update: Your "This algorithm" link target has changed (presumably following Consideration). The content it now links to is completely different from the previous content linked to. Ignore any comments I've made which no longer make any sense (I think I've striken most, if not all, of them). An Update in your post would have been useful!

"Will cause the entire file to be read through a tiny buffer."

The default read cache size is 2MiB. Is this what you're referring to as "a tiny buffer? It can be changed with the memory option: what size would you suggest?

"... will require the disk heads to shuffle back and forth all over the disk to locate the randomly chosen records."

~~I suspect this is a reference to your "This algorithm" link: see the next point.~~

"~~This algorithm~~ [Link removed - see update above] will fairly pick random lines from using a single pass over that file."

That links to "Perl Cookbook: Recipe 8.6. Picking a Random Line from a File". The OP wants to pick each line at random, not just one line at random.

[By the way, the URL associated with that link (i.e. http://docstore.mik.ua/orelly/perl/cookbook/ch08_07.htm) is questionable: it's O'Reilly material provided by someone else. I've seen arguments for and against this specific one, so I'm just pointing it out.]

I have a 10Gb test file (100,000,000 records of 100 bytes each) which I used to test my solution. Not unsurprisingly, this data bore no resemblance to the OP data so I added some additional code to create dummy chromosome and position values:

    #my ($chr, $pos) = (split ' ', $locations[$indexes[$rand_index]])[
+0, 1];
    my ($chr, $pos) = (split '', $locations[$indexes[$rand_index]])[0,
+ 1];
    $chr = 'chromosome' . ($_ % 4 + 1);
    $pos = $_ + 10;
[download]

I also changed the subset size from my original test value of 4 to the OP's example of 500.

With twice the expected file size and additional processing, this took almost exactly 40 minutes.

I tried another solution (without Tie::File) using tell to create the index and seek to locate the wanted random records: beyond this, the rest of the code was unchanged. This solution also took almost exactly 40 minutes.

While other solutions may be faster, I don't consider this to be "horribly slow" or exhibiting "abysmal" performance.

-- Ken

In reply to Re^3: Building a new file by filtering a randomized old file on two fields by kcott
in thread Building a new file by filtering a randomized old file on two fields by mbp

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


No such thing as a small change
	PerlMonks