|laziness, impatience, and hubris|
Re: split and sysread()by BrowserUk (Pope)
|on Apr 19, 2003 at 09:15 UTC||Need Help??|
When you originally asked this question at speed up one-line "sort|uniq -c" perl code you said that you only wanted the 10th field from an unspecified maximum number. In that case, using a regex to isolate that field alone rather than spliting them all out and then discarding all but one was an obvious way to save some cycles. Using the sliding buffer saved some more for an overall speed-up of about x4 in my tests.
You now appear to be wanting fields (0,3,4,9,17,18,31) which means that the benefits of using a regex over split are considerably lessened--though there is still some saving. Using this in conjunction with the sliding buffer--two variations on the theme, with sysread_1 giving consistantly the best results--and a buffer size of 64k seems to achieve the best results on my machine, with the main benefit seemingly coming from bypassing stdio.
The overall saving on my machine comes out at around 50%, whether this will get you close to your target of 2 minutes you have to see once you actually do something meaningful with the fields inside the loop. If not, I think you may need quicker hardware.
The file used in the tests below is (75MB) 500_000 records x 31 pipe-delimited fields of randomly generated data.
Whilst I've tried various buffer sizes, the test are hardly definitive and you may well get better results with a different size (bigger or smaller)on your machine. Good luck.
Examine what is said, not who speaks.
1) When a distinguished but elderly scientist states that something is possible, he is almost certainly right. When he states that something is impossible, he is very probably wrong.
2) The only way of discovering the limits of the possible is to venture a little way past them into the impossible
3) Any sufficiently advanced technology is indistinguishable from magic.
Arthur C. Clarke.