See, it made you think :). Your threaded solution assigns a file to a thread to process, so what if one of the files is much larger than the sum of the rest of the files. The whole execution time will be determined by that one big file.
Now, where did you wander off thinking of bin packing. It is nice that you posses such knowledge but it seems to be getting in the way a bit. How about treating the whole set of files as a single file and piping line by line to each of the work nodes in turn? Would that not divide the load equally?
Your suggestion to order the files by sizes when using threads is certainly an improvement that has the advantage of keeping the size and complexity of the code low but it is still sub-optimal.
The main reason I stick with my approach is because it sort of forces you to think about the design of the solution and I mean this in more general terms. Still, your solution is elegant as it can get for this particular problem.