Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

RE: RE: RE (tilly) 2 (blame): File reading efficiency and other surly remarks

by lhoward (Vicar)
on Aug 26, 2000 at 17:41 UTC ( [id://29807]=note: print w/replies, xml ) Need Help??


in reply to RE: RE (tilly) 2 (blame): File reading efficiency and other surly remarks
in thread File reading efficiency and other surly remarks

I have done some rather benchmarking of "line at a time" vs. "chunk at a time with manual split into lines" vs. "line at a time w/ lots of buffering". "block at a time with manual split into lines" is clearly the fastest by almost 2 to 1 over the other 2 methods. I've included my benchmarking program and results below:
Benchmark: running BufferedFileHandle, chunk, linebyline, each for at +least 3 CP U seconds... BufferedFileHandle: 3 wallclock secs ( 3.22 usr + 0.08 sys = 3.30 C +PU) @ <b>2.73/s</b> (n=9) chunk: 4 wallclock secs ( 2.89 usr + 0.32 sys = 3.21 CPU) @ < +b>4.36/s</b> (n=14) linebyline: 4 wallclock secs ( 3.25 usr + 0.06 sys = 3.31 CPU) @ < +b>2.72/s</b> (n=9)
#!/usr/bin/perl use Benchmark; use strict; use FileHandle; timethese(0, { 'linebyline' => \&linebyline, 'chunk' => \&chunk , 'BufferedFileHandle' => \&BufferedFileHandle }); sub linebyline { open(FILE, "file"); while(<FILE>) { } close(FILE); } sub chunk { my($buf, $leftover, @lines); open(FILE, "file"); while(read FILE, $buf, 64*1024) { $buf = $leftover.$buf; @lines = split(/\n/, $buf); $leftover = ($buf !~ /\n$/) ? pop @lines : ""; foreach (@lines) { } } close(FILE); } sub BufferedFileHandle{ my $fh=new FileHandle; my $buffer_var; $fh->open("file"); $fh->setvbuf($buffer_var, _IOLBF, 64*1024); while(<$fh>) { } close(FILE); }
I'd be very interested to see your results that show diffrently.

Edit to replace CODE tags for PRE tags around long lines

Replies are listed 'Best First'.
RE (tilly) 5: File reading efficiency and other surly remarks
by tilly (Archbishop) on Aug 26, 2000 at 18:37 UTC
    While demonstrating one point is wrong (and again making it clear that until you benchmark, you don't really know what is faster), you demonstrate the other.

    What happens in your chunk code with the last line? Which is more code? And when you are doing fixing that you may still be twice as fast but with quite a bit more (and harder to read) code. Going forward that is more to maintain.

    I would strongly argue against this optimization (which I think might well give different results on different operating systems) until after your system is built and performance is known to be a problem.

    One note though. The IO* modules generally have significant overhead and I don't recommend using them.

    EDIT
    Another bug. You used split in the chunk method without the third argument. Should your block land at the start of a paragraph, you would incorrectly lose lines!

      I never said that the method I posted was easier to maintain; I only stated that it was significantly more efficient. If fast reading of large files (that you can't fit into memory all at once) is you concern; then the block/hand-split method is better. Also; the code I used for the "block and manual split" approach is not by own; but lifted from an earlier perlmonks discussion.
        Specifically see RE (tilly) 6 (bench): File reading efficiency and other surly remarks. Your speed claim can only be made for the specific setup you tested. If your code will need to run on multiple machines then the optimization is almost certainly wasted effort. If performance does not turn out to be a problem, it is likewise counterproductive to have sacrificed maintainability for this.

        In short, the fact that this might be faster is very good to know for the times that you need to squeeze performance out on one specific platform. But don't apply such optimizations until you know that you need to, and don't apply this one until you have benchmarked it against your target setup.

        A few general notes on optimization. Given the overhead of an interpreted language, shorter code is likely to be faster. With well modularized code you retain the ability to recognize algorithm improvements later - which is almost always a better win. Worrying about debuggability up front speeds development and gives more time to worry after the fact about performance. And readable code is easier to understand and optimize.

        Which all boils down to, don't prematurely optimize. Aim for good solid code that you are able to modify after you have enough of your project up and running that you can identify where the bottlenecks really turned out to be.

RE: RE: RE: RE (tilly) 2 (blame): File reading efficiency and other surly remarks
by tye (Sage) on Aug 26, 2000 at 19:57 UTC

    I'd be very interested to see your results that show diffrently.

    Cut, paste, copy Chatter.bat (33KB) to "file", run:

    Benchmark: running BufferedFileHandle, chunk, linebyline, each for at +least 3 CPU seconds... BufferedFileHandle: 4 wallclock secs ( 3.46 usr + 0.00 sys = 3.46 C +PU) @ 386.13/s (n=1336) chunk: 4 wallclock secs ( 3.63 usr + 0.00 sys = 3.63 CPU) @ 31 +0.19/s (n=1126) linebyline: 4 wallclock secs ( 3.40 usr + 0.00 sys = 3.40 CPU) @ 43 +4.71/s (n=1478)

    This shows that default line-by-line is the fastest (434/s), enlarged buffer line-by-line is the 2nd fastest (386/s), and chunk and split is the slowest (310/s).

    Now append Chatter.bat to "file" until we have a 1GB file. Now we have buffered@15/s, line-by-line@13/s, chuck@9/s.

    Find 85MB file: buffered@0.20/s, line-by-line@0.19/s, chunk@0.12/s.

    I'd personally consider perl broken if it couldn't read a line at a time faster than I could in Perl code. Previous benchmarks have shown that Perl's overriding of stdio buffers can make perl's I/O faster than I/O in C programs using stdio. So I must be missing something about (at least) your copy of perl to understand why standard line-by-line isn't faster.

    Update: I removed a pointless sentence that was probably snide. I apologize to those who already read it.

            - tye (but my friends call me "Tye")
      As I told lhoward, the result will be highly dependent upon many things. What OS you are on, what compiler you used, whether you compiled with Perl's I/O or your native one, so on and so forth. (ObRandomNote: Ilya used to moan about the fact that Perl was "pessimized" for I/O on Linux. OTOH Perl is still faster at virtually everything else...)

      I don't doubt for a second that he did that benchmark and got those numbers. I also don't doubt for a second that you did your benchmark and got your numbers as well. The lesson is that this kind of optimization can only be evaluated if you test against your actual target environment.

      But the advantages in maintainability simply cannot be disputed. In addition to the bugs I already pointed out, what happens if someone changes $/ and tries to figure out why nothing happened?

      Those are very interesting results. I have tested my code on several diffrent OSes (Solaris and Linux) with several diffrent versions of perl (5.6,5.005, etc..) and the chunk method has always proven faster in my tests. What OS and version of perl did you test with?

        Win98, Perl5.6.0 plus "ActiveState Build 615".

        Do you have any theories about why the buffering in perl's internals isn't implented as efficiently as your Perl code, especially considering the overhead involved in executing perl opcodes?

                - tye (but my friends call me "Tye")

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://29807]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others exploiting the Monastery: (7)
As of 2024-04-24 09:00 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found