Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

Re^8: selecting columns from a tab-separated-values file

by ibm1620 (Hermit)
on Jan 24, 2013 at 01:37 UTC ( [id://1015056]=note: print w/replies, xml ) Need Help??


in reply to Re^7: selecting columns from a tab-separated-values file
in thread selecting columns from a tab-separated-values file

Interesting procedure! However, this passed the 10M-record file in 83 seconds, as opposed to 60 (to my great surprise).

UPDATE!!! Correction! I accidentally used perl 5.10 for the above test. I have been using 5.16 for everything else. Rerunning with 5.16 yielded a runtime of 60 seconds.

Replies are listed 'Best First'.
Re^9: selecting columns from a tab-separated-values file
by BrowserUk (Patriarch) on Jan 24, 2013 at 09:01 UTC
    Rerunning with 5.16 yielded a runtime of 60 seconds.

    Conclusion: With 384GB of ram; your (relatively) tiny 10e6 lines test file is being read from system file cache, hence effectively disguising the disk IO costs.

    If your 80GB file fits in cache and will always be there when you need to do this; you can ignore the effects of disk.

    Otherwise ... you need to re-run all your testing using the real file and having flushed the cache before each test.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      And/or there's some pretty effective read-ahead action going on with the disk driver.
        or there's some pretty effective read-ahead action going on with the disk driver

        Hm. I used XP personally and professionally for circa 10 years, and I never encountered the situation whereby the first run of a program reading a file wasn't substantially slower than the second run due to cache priming. (Accepting when the file in question was much bigger than the available cache memory, when the second had to re-read the entire file from disk anyway.)

        There have been options (FILE_FLAG_RANDOM_ACCESS/FILE_FLAG_SEQUENTIAL_SCAN) in NTFS since its inception designed to give the OS clues as to the best caching strategy to use. But, a) in some fairly extensive testing I performed back in the day on XP, the use of these flags made little or no detectable difference; b) Perl doesn't use them.

        And the idea is easily disproved. Download CacheSet; start the program, hit the "Clear" button and confirm.

        Then run one of the tests twice in succession. And record the run times. £ to p the first is substantially slower than the second.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1015056]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others cooling their heels in the Monastery: (6)
As of 2024-03-19 02:35 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found