http://www.perlmonks.org?node_id=202936

cluka has asked for the wisdom of the Perl Monks concerning the following question:

Check out the code below (inspired by a comment from dws in a recent chatterbox discussion)
use Benchmark; sub sub1 { my $fh; open $fh, 'vsfull.csv'; binmode $fh; my @lines; my $block; my $left = ''; while( read $fh, $block, 8192 ){ $block = $left . $block; my $i = index $block, "\n"; while($i > 0){ push @lines, substr($block,0,$i); substr($block,0,$i+1,''); $i = index $block, "\n"; } $left = $block; } } sub sub2 { my $fh; my @lines; open $fh, 'vsfull.csv'; while(<$fh>){ push @lines, $_ }; } timethese( 100, { readbig => \&sub1, whilelp => \&sub2 });
The results?
Benchmark: timing 100 iterations of readbig, whilelp... readbig: 25 wallclock secs (25.21 usr + 0.00 sys = 25.21 CPU) @ 3 +.97/s (n=100) whilelp: 157 wallclock secs (156.71 usr + 0.00 sys = 156.71 CPU) @ + 0.64/s (n=100)
Now admittedly, my code is probably clunky and whatnot, but I would assume that this would be the model one would follow for splitting a file into multiple lines. My question has two parts: why would one use <> when it's so slow relative to read, and why hasn't <> been implemented in such a fashion that it takes advantage of read's quickness? Cluka

Replies are listed 'Best First'.
Re: Why use <$fh> at all?
by Anonymous Monk on Oct 05, 2002 at 00:53 UTC
    There is good prior discussion at File reading efficiency and other surly remarks. The short answer is <> has to be slower because it does something more complicated. But on some platforms and versions of Perl it is unreasonably slow, and that has to do with external stdio libraries that it relies on.

      Benchmarking is a complex thing. See podmaster's node showing that <> version is much faster for him and that a simple change makes it faster still. More on this later. (Oh, and your code is badly broken.)

      I stand by a previous statement of mine: I consider perl to be broken if it can't internally implement a faster version of "read a block-at-a-time and split into lines" than you can implement it by writing Perl code. After all, if it can't, we'd be better off replacing the <> implementation with some external module implemented purely in Perl (which could then be converted to C since it all ends up in C anyway, and then optimized, etc.).

      But the fact is that Perl is broken. Perl went out of its way to make <> fast. It did so by doing some "interesting" tricks which meant that, in Perl, <> was sometimes faster than fgets() in C. The problem with this was that these tricks didn't work so well in all cases.

      It is a bit like what was written in the original node. Now you've got some complex, hard-to-maintain code that is a bit faster than some very simple, portable code. When things change (like Linux or PerlIO), suddenly you end up with big, complex code that is also slow. This describes what happened to Perl and it also describes what some are trying to do to "fix" it.

      If you need lines from a file, then use <>. If that ends up being uncomfortably slow, then you might want to look into doing it a different way, like our opening example. But don't "optimize" before you need to.

      And here are my results for the benchmarks. I noticed that you cheated by letting your big code use binmode (which means save the C libraries some translation work on some platforms) and which means that your replacement code doesn't even give the same results. So I fixed that (which, on my platform, makes about a 20% difference in speed).

      Then I checked for other bugs in your code. And this is why you don't get so obsessed about speed! Great, you have code that you think is at lot faster but I've already found two bugs in it (make that 3, if there is no final newline). Put a lot more effort into getting the code correct and a lot less worry into how fast it runs.

      So I rewrote your block-at-a-time code because I thought I saw some places where I could make it faster. (:

      And this is when I found the fourth bug! And this was a big one, that completely invalidates the speed tests for the input file I was using.

      The reason your block-at-a-time code is so much faster is probably that it says $i > 0 instead of $i >= 0 which means that it manages to read a fraction of the total number of lines.

      Sorry, I have to run now.

              - tye (but my friends call me join"",'T','y','e')
Re: Why use <$fh> at all?
by kelan (Deacon) on Oct 05, 2002 at 00:14 UTC
    I think at least one reason is because <> is line-oriented, in the sense that it scans the data for the next line separater and returns everything before that (well, since the last line separater). On the other hand, read is block-oriented. You tell it how big of a block you want, and it reads in that many bytes. It doesn't look at or scan through the data like <> does. So it depends on how much structure you want. If you want the next line of text, <> does that for you, at the cost of a little speed. If you just want the next n-byte chunk, read is faster. You could try to implement <> with read, but what you'd end up doing is reading in some kind of "reasonable" size chunk and scanning through it for the line separater, throwing the rest away or maybe needing to get the next check to find the end of line. And that method doesn't really have any advantages over just using <> in the first place.

    kelan


    Yak it up with Fullscreen ChatterBox

      Actually, after I posted the above message I started work on a module that implements <> (via overloading) using read. It still beats the socks off of the traditional <> and gives the "line-oriented" feel back to the user. (The module essentially reads in a 8k block and feeds lines to the user until it needs to read another block...
Re: Why use <$fh> at all?
by Zaxo (Archbishop) on Oct 05, 2002 at 01:29 UTC

    Perl 5.8.0, w/PerlIO, Linux 2.4:

    $ ./diamond Benchmark: timing 100 iterations of readbig, whilelp... readbig: 10 wallclock secs ( 8.07 usr + 0.29 sys = 8.36 CPU) @ 11 +.96/s (n=100) whilelp: 5 wallclock secs ( 4.79 usr + 0.29 sys = 5.08 CPU) @ 19 +.69/s (n=100) $

    PerlIO certainly appears to be a big improvement. Several runs show a scatter of +/- 0.10/sec in the results. My test file was 8000 lines of 65 characters.

    After Compline,
    Zaxo

Re: Why use <$fh> at all?
by PodMaster (Abbot) on Oct 05, 2002 at 01:42 UTC
    What's this mean?
    sub sub1 { my $fh; open $fh, 'vsfull.csv'; binmode $fh; my @lines; my $block; my $left = ''; while( read $fh, $block, 8192 ){ $block = $left . $block; my $i = index $block, "\n"; while($i > 0){ push @lines, substr($block,0,$i); substr($block,0,$i+1,''); $i = index $block, "\n"; } $left = $block; } } sub sub2 { my $fh; my @lines; open $fh, 'vsfull.csv'; while(<$fh>){ push @lines, $_ }; } sub sub3 { my $fh; open $fh, 'vsfull.csv'; my @lines = <$fh>; } sub sub4 { my $fh; open $fh,'vsfull.csv';my @lines = readline($fh); } #Hmark:: use Benchmark 'cmpthese'; cmpthese( -3, { readbig => \&sub1, whilelp => \&sub2, better => \&sub3, butter => \&sub4, }); __END__ Benchmark: running better, butter, readbig, whilelp, each for at least + 3 CPU seconds... better: 3 wallclock secs ( 2.20 usr + 1.02 sys = 3.22 CPU) @ 37 +2.36/s (n=1199) butter: 3 wallclock secs ( 2.17 usr + 1.04 sys = 3.21 CPU) @ 37 +3.52/s (n=1199) readbig: 3 wallclock secs ( 3.01 usr + 0.12 sys = 3.13 CPU) @ 52 +.72/s (n=165) whilelp: 3 wallclock secs ( 2.17 usr + 1.02 sys = 3.19 CPU) @ 27 +2.10/s (n=868) Rate readbig whilelp better butter readbig 52.7/s -- -81% -86% -86% whilelp 272/s 416% -- -27% -27% better 372/s 606% 37% -- -0% butter 374/s 609% 37% 0% --

    ____________________________________________________
    ** The Third rule of perl club is a statement of fact: pod is sexy.

Re: Why use <$fh> at all?
by sauoq (Abbot) on Oct 05, 2002 at 01:51 UTC
    My question has two parts: why would one use <> when it's so slow relative to read,

    Because efficiency isn't always the goal. If you wanted something to be really fast you'd probably be better off choosing another language. It's very convenient for writing straightforward and readable code.

    and why hasn't <> been implemented in such a fashion that it takes advantage of read's quickness?

    It actually has been, but read() isn't doing all that <> is.

    -sauoq
    "My two cents aren't worth a dime.";
    
Re: Why use <$fh> at all? (bug fix)
by bart (Canon) on Oct 05, 2002 at 08:35 UTC
    Your code contains at least one serious bug:
    my $i = index $block, "\n"; while($i > 0){ push @lines, substr($block,0,$i); substr($block,0,$i+1,''); $i = index $block, "\n"; }
    What if your data contains an empty line? At that point, index $block, "\n" will at one time return 0. From there on, the loop's condition will always return 0, $left will grow until the end of the file, and the data will not ever be pushed onto the array @lines.

    What if the file data doesn't end with "\n"? Again: the last line is disregarded. At least, push the contents of $left onto the array, if it has a length > 0.

    The next code will behave better on both regards, though I haven't benchmarked it, but it could be that your suspiciously excellent results are indeed partly caused by the presence of an empty line.

    my $block; local $_ = ''; my $i; while(my $n = read $fh, $block, 8192 ) { $_ .= $block; while((my $i = index $_, "\n") >= 0){ push @lines, substr($_, 0, $i); substr($_, 0, $i+1, ''); } } push @lines, $_ if length;
Re: Why use <$fh> at all?
by trs80 (Priest) on Oct 05, 2002 at 03:02 UTC
    It depends on the degree of efficency required and the size of data one is working with.
    When thinking of Perl in the mind set of it original goal of the language the <> method makes prefect sense (easy things easy). You read in a small text file and either report on it or make a small change, but when you look at reading in a 56MB log file that someone forgot to put newline characters on (that really happened), then doing it using the read method makes more sense. (hard things possible)
Re: Why use <$fh> at all?
by cluka (Sexton) on Oct 05, 2002 at 07:24 UTC
    In reply to Tye's comments... I appreciate the in-depth commentary. This is where I'm going to learn some real perl. To give some background that might clear up why I wrote the code the way I did: I'm running perl on a Win98 system. I'm writing a piece of code that accepts a collection of text files from a legacy database along with a configuration file and generates a series of reports in MS Excel. The text files are comma-delimited and in a specific, reliable format (for example, no '\n' at the end of file), allowing for some of the assumptions I made. I tested the code above on several of these files and achieved the correct results each time. The reports are extremely time-sensitive, and routinely sum to over 40MB of data. I need the routines to be fast and quick - although I take to heart Tye's comment that fast code is less important than correct code. One question I still have: what effect does binmode have on the data? In WinBlows it looks as though I still end up with text in my final array, regardless of whether I use binmode. In that case, switching to binmode and gaining the speed increase seems reasonable. Thanks.

      Did you read binmode? (Granted, it is rather inaccurate and shows that the author doesn't understand the point -- click on one of the links to more modern versions of the document for much better text.) It says "Files that are not in binary mode have CR LF sequences translated to LF on input", which is accurate for Win32 systems. Checking for such takes some time. Actually having to fix that requires that all of the text in the buffer after any CRs needs to be "moved up", which takes even more time. Checking the source for the standard Microsoft C run-time library, I see that the standard:

      char *in, *out; in= out= buffer; ... *in++ = *out++
      method is used to avoid multiple "move" operations, which means that the cost of moving is incurred even if no CRs are found [ and also makes for much simpler code that is easier to "get right" than if we tried to switch to strchr() and memmove() to allow assembly-language constructs to search the string and to move the bytes (: ].

              - tye (or ldad $54796500 if you're in a hurry)

      The difference binmode makes in DOS and Windows is crucial! Without binmode, all routines have to do (roughly speaking) two things; 1. did we just read a OD OA pair? (if yes, convert to '\n', 2. did we just read end of file? (if yes, stop. In binmode, all we care about is end-of-file. Even end-of-file can cause a problem if there is an embedded ^Z in the file (original DOS end-of-file mark--ignored in binmode.) And since these are implemented in the OS at root (thin wrapper in 'C' library) the distinction is important...

      --hsm

      "Never try to teach a pig to sing...it wastes your time and it annoys the pig."
Re: Why use <$fh> at all?
by blogan (Monk) on Oct 05, 2002 at 15:55 UTC
    Does you test cases include lines that are 15000 characters each or just test cases were the lines are 50 characters each? It's going to make a difference if one read() is going to get 163 lines or one if you need two read()'s to get one line.
      Excellent points! Several people have pointed out very egregious logic errors in the initial code. Several of them are not errors in the sense that I know the specifics of the files I am using (no blank lines, no '\n' at the end of the file).

      I really like blogans comment. Does <$fh> hit the disk each time, or is it reading from a cached block? Does anyone know?

      I guess my initial point was flawed for the general case, but I can reformulate it to a better, stronger statement:

      If you know certain aspects of the files you are reading (e.g. average line size, whether there are blanks in the file, etc) you could implement a bare-bones, lightning-fast read method that passes up the traditional <$fh>. But for a basic, system-independent file-reader, <$fh> is a strong contender.

      Anyone agree?

        Update: a typo and factual error. Mis-read my own benchmark.

        As (was:if) you know your files are not too big to fit in memory and you really need the speed, the add this to your benchmark. It beats your code by 400%60%. Standard perl.

        sub sub3 { open FILE, 'yourfile' or die $!; binmode FILE; my @lines = split $/, do{ local $/; <FILE>; }; close FILE or warn $!; }

        Cor! Like yer ring! ... HALO dammit! ... 'Ave it yer way! Hal-lo, Mister la-de-da. ... Like yer ring!