|Problems? Is your data what you think it is?|
This topic branches out from one of my other posts Efficient way to sum columns in a file. Since this topic is slightly different from the earlier one I am starting a new thread.
I tested two ways to cut columns from a delimited file. The first one being UNIX cut and the other one was a simple Perl script. Unfortunately the Perl script performed poorly against the cut utility. I ran the tests a few times to make sure they are statistically significant
Here is the timed test results -
The above test was done with 500,000 rows and 25 columns. The cut operation was performed to get the first 15 columns. The link above has code to generate random data (thanks to Random Walk).
As you could see the *my* perl script is not as good as UNIX cut. I have two questions here-
1. Can this script be improved so that it is comparable to the UNIX cut command in performance? If the Perl script can finish in 10 seconds that will be great (50% drop in peformance)! I am happy to take this performance drop because it keeps the script clean and portable (typically i work on UNIX machines so this is not a huge requirement)
2. If that is not possible, would you typically consider piping output from cut when the script does not require all the columns for processing? i.e. say the script only needs 3 columns instead of a possible 200 columns then would you pipe the 3 column output from cut instead of spliting the 200 columns in Perl and keeping only the 3 that is required?
I typically work with large files (~a few million rows by 500-800 columns).
Thanks in adavance for your thoughts!