|Problems? Is your data what you think it is?|
One concern I have is that, at some point fairly early on, it seems like the "pipeline" would get full, at which point the two processes would have to operate in lock-step.
The pipe consists of 3 x 4k buffers; one at each end and one 'in the pipe' which is basically kernel mode ram. This is easily demonstrated:
That shows the first process attempting to write 128 byte lines out to a pipe with the second process failing to read from that pipe. It succeeds in writing 96 * 128 bytes (12k) before it blocks pending the second process servicing the pipe. That means that when then first process needs to do another read from disk, the second process has about 100 lines (depending on the length of your lines) to process before it will block waiting for input.
In practice on my system, the second process runs with a very constant 25% CPU (ie. 100% of 1 of 4 cores) and an extremely constant IO rate, which mean that the buffering in the first process and in the pipe between the two processes is absorbing all of the IO waiting due to disk seeks. I don't think that can be improved much.
Another thought I've had is that, since the process appears to me to be CPU-bound, it might be worth forking several children and distributing the work across them. Each child would have to write to a separate output file, which admittedly would increase the possibility of disk head thrashing, but I think it's worth a try.
How are those forked processes going to get their input?
Either way, if disk head thrash isn't a limiting factor -- which if you had raided SSD drives it might not be -- then by far your simplest solution would be to just split your huge input file horizontally into 4 or 8 parts (depending how many CPUs you have) and just run:
Finally, if you are going to be doing this kind of thing -- building files from small, reordered subsets of the total fields -- then you'd probably be better off splitting your input file vertically into a names file, an address file, etc. Then you only need read 3/50 ths of the total data and write the same amount. With a little programming, you can then read a bunch names from the first file into memory; append the same number of fields from the second and third files in memory; and then write that batch out before looping back to do the next batch.
That way, you are effectively reading sequentially from one file at a time; and writing sequentially to one file at a time; which ought to give you the best possible throughput.
The downside is that you have to maintain 50 files in parallel synchronisation. Doable, but risky.
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
In reply to Re^5: selecting columns from a tab-separated-values file