Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much

comment on

( #3333=superdoc: print w/replies, xml ) Need Help??
One concern I have is that, at some point fairly early on, it seems like the "pipeline" would get full, at which point the two processes would have to operate in lock-step.

The pipe consists of 3 x 4k buffers; one at each end and one 'in the pipe' which is basically kernel mode ram. This is easily demonstrated:

C:\test>perl -E"printf( STDERR qq[\r$_] ), say 'X'x127 for 1 ... 1e6" +| perl -nE"sleep 100" 96

That shows the first process attempting to write 128 byte lines out to a pipe with the second process failing to read from that pipe. It succeeds in writing 96 * 128 bytes (12k) before it blocks pending the second process servicing the pipe. That means that when then first process needs to do another read from disk, the second process has about 100 lines (depending on the length of your lines) to process before it will block waiting for input.

In practice on my system, the second process runs with a very constant 25% CPU (ie. 100% of 1 of 4 cores) and an extremely constant IO rate, which mean that the buffering in the first process and in the pipe between the two processes is absorbing all of the IO waiting due to disk seeks. I don't think that can be improved much.

Another thought I've had is that, since the process appears to me to be CPU-bound, it might be worth forking several children and distributing the work across them. Each child would have to write to a separate output file, which admittedly would increase the possibility of disk head thrashing, but I think it's worth a try.

How are those forked processes going to get their input?

  • The parent process reads the input file and writes to multiple pipes?

    If you are frightened that the two process method will lock-step -- meaning that the reader was unable to feed the writer fast enough -- how will making the reader service multiple writers help.

    It might be able to service two kids. Maybe.

    But then you still have the problem of the additional disk head thrash from outputting to two appends points rather than one; concurrently with the reading.

    Worth a kick, but I suspect you'll loose throughput.

  • The kids read from different parts of the file?

    Ignoring the problem of determining start/end points for each kid that coincide with logical record breaks (which is messy but solvable); you just doubled the problem of disk head thrash, by having multiple read points as well as multiple write points.

Either way, if disk head thrash isn't a limiting factor -- which if you had raided SSD drives it might not be -- then by far your simplest solution would be to just split your huge input file horizontally into 4 or 8 parts (depending how many CPUs you have) and just run:

for /l %i in (1,1,8) do @perl -F"\t" -anle"print join chr(9), @F[2,0,5 +]" in%i.tsv > out%i.tsv

Finally, if you are going to be doing this kind of thing -- building files from small, reordered subsets of the total fields -- then you'd probably be better off splitting your input file vertically into a names file, an address file, etc. Then you only need read 3/50 ths of the total data and write the same amount. With a little programming, you can then read a bunch names from the first file into memory; append the same number of fields from the second and third files in memory; and then write that batch out before looping back to do the next batch.

That way, you are effectively reading sequentially from one file at a time; and writing sequentially to one file at a time; which ought to give you the best possible throughput.

The downside is that you have to maintain 50 files in parallel synchronisation. Doable, but risky.

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

In reply to Re^5: selecting columns from a tab-separated-values file by BrowserUk
in thread selecting columns from a tab-separated-values file by ibm1620

Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":

  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?

    What's my password?
    Create A New User
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others taking refuge in the Monastery: (3)
    As of 2019-10-18 22:18 GMT
    Find Nodes?
      Voting Booth?