Re^6: Optimise file line by line parsing, substitute SPLIT

in reply to Re^5: Optimise file line by line parsing, substitute SPLIT
in thread Optimise file line by line parsing, substitute SPLIT

I thought your point whas that OP is actually do nothing with data (read=nothing, read+split=nothing too), and he's going to read every word on every page soon, then split time will be insignificant.

But it seems that you mean that OP benchmarks incorrect, because he benchmarks nothing vs split.

Otherwise I agree that split is can't be really optimized, just like I wrote above

Comment on Re^6: Optimise file line by line parsing, substitute SPLIT

Replies are listed 'Best First'.
Re^7: Optimise file line by line parsing, substitute SPLIT by BrowserUk (Patriarch) on Jun 03, 2013 at 16:03 UTC
But it seems that you mean that OP benchmarks incorrect, because he benchmarks nothing vs split. No. As a measure of the time taken to do the splits, his benchmark is fine. What is wrong is his apparent expectation that locating 26 million tab characters; copying 28 million strings and making 28 million assignments would (or should) take less than 8 seconds it does. 80 million fairly complex operations in 8 seconds is 1 every 10th of a microsecond. And is pretty damn good. The only ways to reduce that amount of time are:: Overlap the IO and processing. 8 - 1.3 = 6.7 seconds assuming perfect overlap which is pretty much impossible. 2009.3 = 1860 -v- 200 6.7 = 1340 28% as a target; but achieving it would be very hard. Run (some of) the 200+ processes in parallel. Doing 2 at a time would be a 50% gain. 4 at a time 75%. Much better targets and actually pretty close to achievable; but required careful programming to avoid disk thrash. Do less work. Adding a single line to my code above: `next unless /$V/;` [download] Can get a 90% savings for some cases: `C:\test>1036737 -V=500 < numbers.tsv Took 19.138550 seconds ## without pre-filter Kept 2005 records C:\test>1036737 -V=500 < numbers.tsv Took 1.755853 seconds ## with pre-filter Kept 2005 records` [download] But that saving is negated and actually worse for less specific searches: `C:\test>1036737 -V=5 < numbers.tsv Took 18.765492 seconds ## Without pre-filter Kept 1944 records C:\test>1036737 -V=5 < numbers.tsv Took 20.232294 seconds ## With pre-filter Kept 1944 records` [download] With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l] [select]

Replies are listed 'Best First'.

Re^7: Optimise file line by line parsing, substitute SPLIT
by BrowserUk (Patriarch) on Jun 03, 2013 at 16:03 UTC

But it seems that you mean that OP benchmarks incorrect, because he benchmarks nothing vs split.

No. As a measure of the time taken to do the splits, his benchmark is fine.

What is wrong is his apparent expectation that locating 26 million tab characters; copying 28 million strings and making 28 million assignments would (or should) take less than 8 seconds it does. 80 million fairly complex operations in 8 seconds is 1 every 10th of a microsecond. And is pretty damn good.

The only ways to reduce that amount of time are::

Overlap the IO and processing.
8 - 1.3 = 6.7 seconds assuming perfect overlap which is pretty much impossible.
200*9.3 = 1860 -v- 200 * 6.7 = 1340
28% as a target; but achieving it would be very hard.
Run (some of) the 200+ processes in parallel.
Doing 2 at a time would be a 50% gain. 4 at a time 75%.
Much better targets and actually pretty close to achievable; but required careful programming to avoid disk thrash.

Do less work.

Adding a single line to my code above:

    next unless /$V/;
[download]

Can get a 90% savings for some cases:

C:\test>1036737 -V=500 < numbers.tsv
Took 19.138550 seconds ## without pre-filter
Kept 2005 records

C:\test>1036737 -V=500 < numbers.tsv
Took 1.755853 seconds ## with pre-filter
Kept 2005 records
[download]

But that saving is negated and actually worse for less specific searches:

C:\test>1036737 -V=5 < numbers.tsv
Took 18.765492 seconds ## Without pre-filter
Kept 1944 records

C:\test>1036737 -V=5 < numbers.tsv
Took 20.232294 seconds ## With pre-filter
Kept 1944 records
[download]

With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.

"Science is about questioning the status quo. Questioning authority".

In the absence of evidence, opinion is indistinguishable from prejudice.

[reply]
[d/l]
[select]

In Section Seekers of Perl Wisdom