Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Re^4: Optimise file line by line parsing, substitute SPLIT

by vsespb (Hermit)
on Jun 03, 2013 at 14:32 UTC ( #1036773=note: print w/ replies, xml ) Need Help??


in reply to Re^3: Optimise file line by line parsing, substitute SPLIT
in thread Optimise file line by line parsing, substitute SPLIT

more quickly than you can read the file and do nothing

That does not have to be more quickly, just comparable time. 20%-30% is already significant.

Also, concept that whole application run time (from start to finish) is significant is a bit wrong.

Often startup time (when actually file is read) is significant, and after startup application is actually doing something useful (and can be blocked by disk/network IO or waiting for user action) till system reboot

Do you want me paste code where split() taking more than 20% of time when I just read file to memory and skip some/most of records ?


Comment on Re^4: Optimise file line by line parsing, substitute SPLIT
Re^5: Optimise file line by line parsing, substitute SPLIT
by BrowserUk (Pope) on Jun 03, 2013 at 14:54 UTC
    Do you want me paste code where split() taking more {blah}

    I want you to post code -- directly comparable to the OPs -- where doing something takes longer than doing nothing.

    But, if you really want to play, show me code that filters a 2 million line x 11 TAB separated fields, file on the value of a field whose number and filter value I supply on the command line, more quickly than:

    #! perl -slw use strict; use Time::HiRes qw[ time ]; our $FNO //= 6; our $V //= 500; my $start = time; my @filtered; while( <> ) { my @fields = split( "\t", $_ ); $fields[ $FNO ] == $V and push @filtered,$_; } printf "Took %f seconds\n", time() - $start; printf "Kept %u records\n", scalar @filtered; __END__ C:\test>1036737 -FNO=6 -V=500 < numbers.tsv Took 19.072147 seconds Kept 2005 records C:\test>1036737 -FNO=6 -V=500 < numbers.tsv Took 19.021369 seconds Kept 2005 records

    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
    /blockquote

      I thought your point whas that OP is actually do nothing with data (read=nothing, read+split=nothing too), and he's going to read every word on every page soon, then split time will be insignificant.

      But it seems that you mean that OP benchmarks incorrect, because he benchmarks nothing vs split.

      Otherwise I agree that split is can't be really optimized, just like I wrote above

        But it seems that you mean that OP benchmarks incorrect, because he benchmarks nothing vs split.

        No. As a measure of the time taken to do the splits, his benchmark is fine.

        What is wrong is his apparent expectation that locating 26 million tab characters; copying 28 million strings and making 28 million assignments would (or should) take less than 8 seconds it does. 80 million fairly complex operations in 8 seconds is 1 every 10th of a microsecond. And is pretty damn good.

        The only ways to reduce that amount of time are::

        • Overlap the IO and processing.

          8 - 1.3 = 6.7 seconds assuming perfect overlap which is pretty much impossible.

          200*9.3 = 1860 -v- 200 * 6.7 = 1340

          28% as a target; but achieving it would be very hard.

        • Run (some of) the 200+ processes in parallel.

          Doing 2 at a time would be a 50% gain. 4 at a time 75%.

          Much better targets and actually pretty close to achievable; but required careful programming to avoid disk thrash.

        • Do less work.

          Adding a single line to my code above:

          next unless /$V/;

          Can get a 90% savings for some cases:

          C:\test>1036737 -V=500 < numbers.tsv Took 19.138550 seconds ## without pre-filter Kept 2005 records C:\test>1036737 -V=500 < numbers.tsv Took 1.755853 seconds ## with pre-filter Kept 2005 records

          But that saving is negated and actually worse for less specific searches:

          C:\test>1036737 -V=5 < numbers.tsv Took 18.765492 seconds ## Without pre-filter Kept 1944 records C:\test>1036737 -V=5 < numbers.tsv Took 20.232294 seconds ## With pre-filter Kept 1944 records

        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1036773]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chanting in the Monastery: (13)
As of 2014-09-18 20:01 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (123 votes), past polls