in reply to Re^4: selecting columns from a tab-separated-values file
in thread selecting columns from a tab-separated-values file
One concern I have is that, at some point fairly early on, it seems like the "pipeline" would get full, at which point the two processes would have to operate in lock-step.
The pipe consists of 3 x 4k buffers; one at each end and one 'in the pipe' which is basically kernel mode ram. This is easily demonstrated:
C:\test>perl -E"printf( STDERR qq[\r$_] ), say 'X'x127 for 1 ... 1e6" +| perl -nE"sleep 100" 96
That shows the first process attempting to write 128 byte lines out to a pipe with the second process failing to read from that pipe. It succeeds in writing 96 * 128 bytes (12k) before it blocks pending the second process servicing the pipe. That means that when then first process needs to do another read from disk, the second process has about 100 lines (depending on the length of your lines) to process before it will block waiting for input.
In practice on my system, the second process runs with a very constant 25% CPU (ie. 100% of 1 of 4 cores) and an extremely constant IO rate, which mean that the buffering in the first process and in the pipe between the two processes is absorbing all of the IO waiting due to disk seeks. I don't think that can be improved much.
Another thought I've had is that, since the process appears to me to be CPU-bound, it might be worth forking several children and distributing the work across them. Each child would have to write to a separate output file, which admittedly would increase the possibility of disk head thrashing, but I think it's worth a try.
How are those forked processes going to get their input?
- The parent process reads the input file and writes to multiple pipes?
If you are frightened that the two process method will lock-step -- meaning that the reader was unable to feed the writer fast enough -- how will making the reader service multiple writers help.
It might be able to service two kids. Maybe.
But then you still have the problem of the additional disk head thrash from outputting to two appends points rather than one; concurrently with the reading.
Worth a kick, but I suspect you'll loose throughput.
- The kids read from different parts of the file?
Ignoring the problem of determining start/end points for each kid that coincide with logical record breaks (which is messy but solvable); you just doubled the problem of disk head thrash, by having multiple read points as well as multiple write points.
Either way, if disk head thrash isn't a limiting factor -- which if you had raided SSD drives it might not be -- then by far your simplest solution would be to just split your huge input file horizontally into 4 or 8 parts (depending how many CPUs you have) and just run:
for /l %i in (1,1,8) do @perl -F"\t" -anle"print join chr(9), @F[2,0,5 +]" in%i.tsv > out%i.tsv
Finally, if you are going to be doing this kind of thing -- building files from small, reordered subsets of the total fields -- then you'd probably be better off splitting your input file vertically into a names file, an address file, etc. Then you only need read 3/50 ths of the total data and write the same amount. With a little programming, you can then read a bunch names from the first file into memory; append the same number of fields from the second and third files in memory; and then write that batch out before looping back to do the next batch.
That way, you are effectively reading sequentially from one file at a time; and writing sequentially to one file at a time; which ought to give you the best possible throughput.
The downside is that you have to maintain 50 files in parallel synchronisation. Doable, but risky.
|
---|
Replies are listed 'Best First'. | |
---|---|
Re^6: selecting columns from a tab-separated-values file
by ibm1620 (Hermit) on Jan 23, 2013 at 20:52 UTC | |
by BrowserUk (Patriarch) on Jan 24, 2013 at 00:21 UTC | |
by ibm1620 (Hermit) on Jan 24, 2013 at 01:37 UTC | |
by BrowserUk (Patriarch) on Jan 24, 2013 at 09:01 UTC | |
by ibm1620 (Hermit) on Jan 24, 2013 at 19:38 UTC | |
|