Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

Re^6: selecting columns from a tab-separated-values file

by ibm1620 (Hermit)
on Jan 23, 2013 at 20:52 UTC ( [id://1015021]=note: print w/replies, xml ) Need Help??


in reply to Re^5: selecting columns from a tab-separated-values file
in thread selecting columns from a tab-separated-values file

I happen to already have program that spawns N children, which then establish socket connections with their parent. With that framework in place, I'd just have the children do the split/join/write, and spray the input across their fd's. (As it turns out, the reading is fast - in the ibuf/obuf experiments, ibuf consumed 6% CPU while obuf consumed 100%.)

Disk I/O doesn't seem to be an issue on this box. The drive spins at 15K RPM, and there's 384GB of RAM! So I would expect an almost linear (if that's the word I want) speed-up by splitting across a few children.

Replies are listed 'Best First'.
Re^7: selecting columns from a tab-separated-values file
by BrowserUk (Patriarch) on Jan 24, 2013 at 00:21 UTC

    With 80GB of data and 384GB of ram, I's remove IO thrash from the picture entirely by slurping the entire dataset into ram first. Something like this:

    #! perl -slw use strict; open my $fh, '<:raw', $ARGV[ 0 ] or die $!; sysread $fh, my $slurp, -s( $ARGV[0] ); close $fh; local $, = "\t"; open RAM, '<', \$slurp; while( <RAM> ) { my @f = split "\t"; print @f[ 2,0,5 ], }

    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      Interesting procedure! However, this passed the 10M-record file in 83 seconds, as opposed to 60 (to my great surprise).

      UPDATE!!! Correction! I accidentally used perl 5.10 for the above test. I have been using 5.16 for everything else. Rerunning with 5.16 yielded a runtime of 60 seconds.

        Rerunning with 5.16 yielded a runtime of 60 seconds.

        Conclusion: With 384GB of ram; your (relatively) tiny 10e6 lines test file is being read from system file cache, hence effectively disguising the disk IO costs.

        If your 80GB file fits in cache and will always be there when you need to do this; you can ignore the effects of disk.

        Otherwise ... you need to re-run all your testing using the real file and having flushed the cache before each test.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1015021]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others sharing their wisdom with the Monastery: (4)
As of 2024-03-19 05:04 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found