Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

Re: Randomize lines with limited memory

by Corion (Pope)
on Nov 01, 2003 at 22:23 UTC ( #303846=note: print w/ replies, xml ) Need Help??


in reply to Randomize lines with limited memory

You can simply trade time for space here:

  1. Count the number of lines in your text file
  2. Generate a randomized order of lines by choosing a permutation of the line numbers (this will take about 32 bytes per line number).
  3. Rewind your text file.
  4. Read up to the line that is to become the next line written.
  5. Copy it into the target file.
  6. Repeat the three previous steps until the shuffled file is written out.

Of course, this method won't work well if you always choose the same permutation, but you could also do a Fisher-Yates shuffle of the line numbers and then recreate the file from the line numbers as above.

perl -MHTTP::Daemon -MHTTP::Response -MLWP::Simple -e ' ; # The $d = new HTTP::Daemon and fork and getprint $d->url and exit;#spider ($c = $d->accept())->get_request(); $c->send_response( new #in the HTTP::Response(200,$_,$_,qq(Just another Perl hacker\n))); ' # web


Comment on Re: Randomize lines with limited memory
Download Code
Re: Re: Randomize lines with limited memory
by sauoq (Abbot) on Nov 01, 2003 at 23:12 UTC

    I'd probably use a variation on this. Read the file once saving the offset of each line. Shuffle the array of offsets. Step through the array, seeking to each offset in turn, and print the line at that offset.

    Using seek() will avoid all that rewinding and reading.

    By the way, you can do almost exactly this but without worrying about the nitty gritty. Use Tie::File. Just tie the file to @tied_file, shuffle the numbers 0 .. $#tied_file and print out each line in the array in shuffled order.

    -sauoq
    "My two cents aren't worth a dime.";
    
Re: Re: Randomize lines with limited memory
by TomDLux (Vicar) on Nov 03, 2003 at 02:53 UTC
    Your method involves 3.7 million random file seeks.

    --
    TTTATCGGTCGTTATATAGATGTTTGCA

      I did mention that I trade space (memory) for time, didn't I?

      Also, my method dosen't involve random seeks, it only involves rewinding the file to the start point, which is a seek to a fixed position in the file. I think you could halve the number of times the file by checking whether the next line number is greater than the current line number and then winding forward to that one, as I think that there is a 50% chance that the next line number to be written is higher than the current number.

      That point is moot though, as Tie::File implements a caching scheme and is a much cleaner solution than my naive but memory conserving approach.

      perl -MHTTP::Daemon -MHTTP::Response -MLWP::Simple -e ' ; # The $d = new HTTP::Daemon and fork and getprint $d->url and exit;#spider ($c = $d->accept())->get_request(); $c->send_response( new #in the HTTP::Response(200,$_,$_,qq(Just another Perl hacker\n))); ' # web

        Unfortunately, Tie::File is not a solution to this problem.

        For a file of 100,000 lines, it completes, but takes around 800 times longer than List::Util::shuffle on an in memory array. So what, trade time for space. The problem is that the increase in time is exponential, meaning that shuffling the OP's 27,000,000 record file would take close to a week!

        However, more fundemental than that is that Tie::File simply isn't able to cope with a file this large. It runs out of memory and segfaults trying this with a 1,000,000 record file on my machine, where as I can comfortably load a 2,000,000 record file completelly into memory and sort even using the copy semantics of List:Util::shuffle. This is true regardles of what settings (default, bigger or smaller) I tried using for the memory parameter.


        Examine what is said, not who speaks.
        "Efficiency is intelligent laziness." -David Dunham
        "Think for yourself!" - Abigail
        Hooray!
        Wanted!

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://303846]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (10)
As of 2014-12-20 04:55 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (95 votes), past polls