I am likely being thick, but I don't understand. The value of the column (not a column name) is what is being used as the file name. It is not possible to know in advance the values without going through every line of every file first. Even if you did that, you would still need to store the information in a hash so that you could look up the filehandle corresponding to that value later so I see this as a slower variation on my proposed solution. What am I missing?
Ah, sorry I wasn't clear. I assumed that one knew the (small) set of possible column values to be used as filenames. If you do not know this set of values, my method may still be faster, but prescanning the table will add some time to the execution.
Once you have established a hashmap from column values to filehandles, then you can print to the desired filehandle. I expect a single hash lookup to be much faster than a pair of system calls for opening and closing files; in addition to the OS bookkeeping and disk IO overhead for opening and closing, each file buffer is flushed (and, depending on the OS and filesystem, the disk is written to) for every line written.
Another completely different method is to append the lines to different strings, one for each column value. Then write them all the strings out to files after the loop.
I expect a single hash lookup to be much faster than a pair of system calls for opening and closing files
I don't want to sound like I am beating a dead horse here, but that sounds identical to my solution except your way seems like it would be slower because instead of figuring it out as you go, you are processing the files twice.