http://www.perlmonks.org?node_id=1014825


in reply to Re^4: selecting columns from a tab-separated-values file
in thread selecting columns from a tab-separated-values file

One concern I have is that, at some point fairly early on, it seems like the "pipeline" would get full, at which point the two processes would have to operate in lock-step.

The pipe consists of 3 x 4k buffers; one at each end and one 'in the pipe' which is basically kernel mode ram. This is easily demonstrated:

C:\test>perl -E"printf( STDERR qq[\r$_] ), say 'X'x127 for 1 ... 1e6" +| perl -nE"sleep 100" 96

That shows the first process attempting to write 128 byte lines out to a pipe with the second process failing to read from that pipe. It succeeds in writing 96 * 128 bytes (12k) before it blocks pending the second process servicing the pipe. That means that when then first process needs to do another read from disk, the second process has about 100 lines (depending on the length of your lines) to process before it will block waiting for input.

In practice on my system, the second process runs with a very constant 25% CPU (ie. 100% of 1 of 4 cores) and an extremely constant IO rate, which mean that the buffering in the first process and in the pipe between the two processes is absorbing all of the IO waiting due to disk seeks. I don't think that can be improved much.

Another thought I've had is that, since the process appears to me to be CPU-bound, it might be worth forking several children and distributing the work across them. Each child would have to write to a separate output file, which admittedly would increase the possibility of disk head thrashing, but I think it's worth a try.

How are those forked processes going to get their input?

Either way, if disk head thrash isn't a limiting factor -- which if you had raided SSD drives it might not be -- then by far your simplest solution would be to just split your huge input file horizontally into 4 or 8 parts (depending how many CPUs you have) and just run:

for /l %i in (1,1,8) do @perl -F"\t" -anle"print join chr(9), @F[2,0,5 +]" in%i.tsv > out%i.tsv

Finally, if you are going to be doing this kind of thing -- building files from small, reordered subsets of the total fields -- then you'd probably be better off splitting your input file vertically into a names file, an address file, etc. Then you only need read 3/50 ths of the total data and write the same amount. With a little programming, you can then read a bunch names from the first file into memory; append the same number of fields from the second and third files in memory; and then write that batch out before looping back to do the next batch.

That way, you are effectively reading sequentially from one file at a time; and writing sequentially to one file at a time; which ought to give you the best possible throughput.

The downside is that you have to maintain 50 files in parallel synchronisation. Doable, but risky.


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

Replies are listed 'Best First'.
Re^6: selecting columns from a tab-separated-values file
by ibm1620 (Hermit) on Jan 23, 2013 at 20:52 UTC

    I happen to already have program that spawns N children, which then establish socket connections with their parent. With that framework in place, I'd just have the children do the split/join/write, and spray the input across their fd's. (As it turns out, the reading is fast - in the ibuf/obuf experiments, ibuf consumed 6% CPU while obuf consumed 100%.)

    Disk I/O doesn't seem to be an issue on this box. The drive spins at 15K RPM, and there's 384GB of RAM! So I would expect an almost linear (if that's the word I want) speed-up by splitting across a few children.

      With 80GB of data and 384GB of ram, I's remove IO thrash from the picture entirely by slurping the entire dataset into ram first. Something like this:

      #! perl -slw use strict; open my $fh, '<:raw', $ARGV[ 0 ] or die $!; sysread $fh, my $slurp, -s( $ARGV[0] ); close $fh; local $, = "\t"; open RAM, '<', \$slurp; while( <RAM> ) { my @f = split "\t"; print @f[ 2,0,5 ], }

      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

        Interesting procedure! However, this passed the 10M-record file in 83 seconds, as opposed to 60 (to my great surprise).

        UPDATE!!! Correction! I accidentally used perl 5.10 for the above test. I have been using 5.16 for everything else. Rerunning with 5.16 yielded a runtime of 60 seconds.