I happen to already have program that spawns N children, which then establish socket connections with their parent. With that framework in place, I'd just have the children do the split/join/write, and spray the input across their fd's. (As it turns out, the reading is fast - in the ibuf/obuf experiments, ibuf consumed 6% CPU while obuf consumed 100%.)
Disk I/O doesn't seem to be an issue on this box. The drive spins at 15K RPM, and there's 384GB of RAM! So I would expect an almost linear (if that's the word I want) speed-up by splitting across a few children.