Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Re^2: Building a new file by filtering a randomized old file on two fields

by BrowserUk (Pope)
on Apr 30, 2014 at 11:33 UTC ( #1084481=note: print w/ replies, xml ) Need Help??


in reply to Re: Building a new file by filtering a randomized old file on two fields
in thread Building a new file by filtering a randomized old file on two fields

To get around potential memory issues (due to 4-5 Gb files), you can use Tie::File. This will not load the entire file into memory.

That code will be horribly slow.

This single line:

my $last_index = $#locations;

Will cause the entire file to be read through a tiny buffer.

And this line:

my ($chr, $pos) = (split ' ', $locations[$indexes[$rand_index]])[0 +, 1];

will require the disk heads to shuffle back and forth all over the disk to locate the randomly chosen records.

Many parts of the file will be read and re-read many times. Performance will be abysmal.

This algorithm will fairly pick random lines from using a single pass over that file.


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.


Comment on Re^2: Building a new file by filtering a randomized old file on two fields
Select or Download Code
Re^3: Building a new file by filtering a randomized old file on two fields
by kcott (Abbot) on Apr 30, 2014 at 20:04 UTC

    Update: Your "This algorithm" link target has changed (presumably following Consideration). The content it now links to is completely different from the previous content linked to. Ignore any comments I've made which no longer make any sense (I think I've striken most, if not all, of them). An Update in your post would have been useful!

    "Will cause the entire file to be read through a tiny buffer."

    The default read cache size is 2MiB. Is this what you're referring to as "a tiny buffer? It can be changed with the memory option: what size would you suggest?

    "... will require the disk heads to shuffle back and forth all over the disk to locate the randomly chosen records."

    I suspect this is a reference to your "This algorithm" link: see the next point.

    "This algorithm [Link removed - see update above] will fairly pick random lines from using a single pass over that file."

    That links to "Perl Cookbook: Recipe 8.6. Picking a Random Line from a File". The OP wants to pick each line at random, not just one line at random.

    [By the way, the URL associated with that link (i.e. http://docstore.mik.ua/orelly/perl/cookbook/ch08_07.htm) is questionable: it's O'Reilly material provided by someone else. I've seen arguments for and against this specific one, so I'm just pointing it out.]

    I have a 10Gb test file (100,000,000 records of 100 bytes each) which I used to test my solution. Not unsurprisingly, this data bore no resemblance to the OP data so I added some additional code to create dummy chromosome and position values:

    #my ($chr, $pos) = (split ' ', $locations[$indexes[$rand_index]])[ +0, 1]; my ($chr, $pos) = (split '', $locations[$indexes[$rand_index]])[0, + 1]; $chr = 'chromosome' . ($_ % 4 + 1); $pos = $_ + 10;

    I also changed the subset size from my original test value of 4 to the OP's example of 500.

    With twice the expected file size and additional processing, this took almost exactly 40 minutes.

    I tried another solution (without Tie::File) using tell to create the index and seek to locate the wanted random records: beyond this, the rest of the code was unchanged. This solution also took almost exactly 40 minutes.

    While other solutions may be faster, I don't consider this to be "horribly slow" or exhibiting "abysmal" performance.

    -- Ken

        Yes, that's the same content that was originally linked to. Thanks.

        -- Ken

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1084481]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others studying the Monastery: (8)
As of 2014-08-02 09:09 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Who would be the most fun to work for?















    Results (55 votes), past polls