http://www.perlmonks.org?node_id=1014812


in reply to Re^2: selecting columns from a tab-separated-values file
in thread selecting columns from a tab-separated-values file

The construction of the arrays from the split lines consumes a fair amount of CPU. As does the allocation of the buffers for reading into and writing from. As does the searching of the 4K buffers to locate the line ends. And the search of the lines to find the tabs.

But still, the minimum elapsed time is dominated by the time it takes to a) read 80 GB from disk; whilst simultaneously writing ~5 GB back to disk. On my system 2.6 GHz 4-core Q6600 + 7200 RPM SATA2 disk with 16MB ram, that reading and writing process -- with no other processing; searching; memory allocations etc. -- requires a minimum of 56 minutes.

The difference between that and your 5 hours comes in two parts.

  1. The internal processing of split & join.
  2. The waiting for IO.

If you can overlap the IO and the internal processing, you can claw back some of that difference.

Have you tried my two processes solution?


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
  • Comment on Re^3: selecting columns from a tab-separated-values file

Replies are listed 'Best First'.
Re^4: selecting columns from a tab-separated-values file
by ibm1620 (Hermit) on Jan 23, 2013 at 20:25 UTC

    I'm back at work and have tested the two-process solution. It took 60 seconds to pass 10M (M=million) records. Then I pulled the logic for splitting and joining the records out of obuf and into ibuf (thus eliminating obuf) and ran the same test, and it ran in 62 seconds. (In both cases the output was to /dev/null.)<\p>

    I reran the tests sending output to an actual file in the same directory as the input, and obtained exactly the same runtimes.

    In ALL cases I observed the CPU of the process that was doing the split/join to peg at 100%.

    So I have to conclude that disk I/O is negligible for this program, in my environment.
      It took 60 seconds to pass 10M (M=million) records.

      Hm. 10e6 in a minute suggests a total time for 1e9 of well under 2 hours.

      This post mentions a time of 5 hours. What changed?


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        The number of fields being extracted, mainly. My example of three fields was just a simplified illustration of my question.
Re^4: selecting columns from a tab-separated-values file
by ibm1620 (Hermit) on Jan 23, 2013 at 02:25 UTC

    I've not yet tried the two-process solution, but I intend to. One concern I have is that, at some point fairly early on, it seems like the "pipeline" would get full, at which point the two processes would have to operate in lock-step.

    Another thought I've had is that, since the process appears to me to be CPU-bound, it might be worth forking several children and distributing the work across them. Each child would have to write to a separate output file, which admittedly would increase the possibility of disk head thrashing, but I think it's worth a try.

      One concern I have is that, at some point fairly early on, it seems like the "pipeline" would get full, at which point the two processes would have to operate in lock-step.

      The pipe consists of 3 x 4k buffers; one at each end and one 'in the pipe' which is basically kernel mode ram. This is easily demonstrated:

      C:\test>perl -E"printf( STDERR qq[\r$_] ), say 'X'x127 for 1 ... 1e6" +| perl -nE"sleep 100" 96

      That shows the first process attempting to write 128 byte lines out to a pipe with the second process failing to read from that pipe. It succeeds in writing 96 * 128 bytes (12k) before it blocks pending the second process servicing the pipe. That means that when then first process needs to do another read from disk, the second process has about 100 lines (depending on the length of your lines) to process before it will block waiting for input.

      In practice on my system, the second process runs with a very constant 25% CPU (ie. 100% of 1 of 4 cores) and an extremely constant IO rate, which mean that the buffering in the first process and in the pipe between the two processes is absorbing all of the IO waiting due to disk seeks. I don't think that can be improved much.

      Another thought I've had is that, since the process appears to me to be CPU-bound, it might be worth forking several children and distributing the work across them. Each child would have to write to a separate output file, which admittedly would increase the possibility of disk head thrashing, but I think it's worth a try.

      How are those forked processes going to get their input?

      • The parent process reads the input file and writes to multiple pipes?

        If you are frightened that the two process method will lock-step -- meaning that the reader was unable to feed the writer fast enough -- how will making the reader service multiple writers help.

        It might be able to service two kids. Maybe.

        But then you still have the problem of the additional disk head thrash from outputting to two appends points rather than one; concurrently with the reading.

        Worth a kick, but I suspect you'll loose throughput.

      • The kids read from different parts of the file?

        Ignoring the problem of determining start/end points for each kid that coincide with logical record breaks (which is messy but solvable); you just doubled the problem of disk head thrash, by having multiple read points as well as multiple write points.

      Either way, if disk head thrash isn't a limiting factor -- which if you had raided SSD drives it might not be -- then by far your simplest solution would be to just split your huge input file horizontally into 4 or 8 parts (depending how many CPUs you have) and just run:

      for /l %i in (1,1,8) do @perl -F"\t" -anle"print join chr(9), @F[2,0,5 +]" in%i.tsv > out%i.tsv

      Finally, if you are going to be doing this kind of thing -- building files from small, reordered subsets of the total fields -- then you'd probably be better off splitting your input file vertically into a names file, an address file, etc. Then you only need read 3/50 ths of the total data and write the same amount. With a little programming, you can then read a bunch names from the first file into memory; append the same number of fields from the second and third files in memory; and then write that batch out before looping back to do the next batch.

      That way, you are effectively reading sequentially from one file at a time; and writing sequentially to one file at a time; which ought to give you the best possible throughput.

      The downside is that you have to maintain 50 files in parallel synchronisation. Doable, but risky.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

        I happen to already have program that spawns N children, which then establish socket connections with their parent. With that framework in place, I'd just have the children do the split/join/write, and spray the input across their fd's. (As it turns out, the reading is fast - in the ibuf/obuf experiments, ibuf consumed 6% CPU while obuf consumed 100%.)

        Disk I/O doesn't seem to be an issue on this box. The drive spins at 15K RPM, and there's 384GB of RAM! So I would expect an almost linear (if that's the word I want) speed-up by splitting across a few children.