Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

File reading efficiency and other surly remarks

by Hot Pastrami (Monk)
on Aug 26, 2000 at 04:50 UTC ( [id://29760]=perlquestion: print w/replies, xml ) Need Help??

Hot Pastrami has asked for the wisdom of the Perl Monks concerning the following question:

Er, hi.

It is true that I will weep with joy when the day arrives that I can answer more PerlMonk questions than I post. I'll gnash teeth, rend clothes, and some real Bible-style celebrating will go on. But alas, that day is not today. So...

I've seen benchmarks that indicate line-at-a-time file reading is not as efficient as slurping up the whole dad-gum thing into an array and stepping through it... depending on file size. Not hard to believe. So does anyone care to venture a guess (or happen to know) at approximately what file size method B will be more efficient (and quicker, naturally) than method B? Or, am I dead wrong, and one method is ALWAYS more efficient?

The idea is to find a listing in a newline-delimited file, pretty standard stuff, really (forgive the sloppy example code):
# METHOD A: # whole file slurp and split $him = ""; $enlightenment = "joy"; $trying = "oops"; @fileContents = ""; open(TEST, "test.txt") or die $trying; local $/ = undef; @fileContents = <TEST>; close(TEST); foreach (@fileContents) { my ($person, $wisdom) = split /:/, $_, 2; if ($wisdom eq $enlightenment) { $him = $person; last; } } # METHOD B: # line-at-a-time open (TEST, "test.txt") or die $trying; $/ = "\n"; while (&lt;TEST&gt;) { my ($person, $wisdom) = split /:/, $_, 2; if ($wisdom eq $enlightenment) { $him = $person; last; } }
Thanks for any info you've got.

Alan "Hot Pastrami" Bellows
-Sitting calmly with scissors-

Replies are listed 'Best First'.
(Ovid - when *not* to optimize) Re: File reading efficiency and other surly remarks
by Ovid (Cardinal) on Aug 26, 2000 at 06:10 UTC
    One thing to keep in mind is that while striving for efficiency is important, it waste time profitably spent elsewhere. If you have a script that runs twice as long as it could, what does it matter if it only runs once a month? I'm not saying that's the case here. It's just something I like to keep in mind.

    Oftimes, while I find that I can write a more efficient regex than one I find in a program that I am working on, I ask myself two questions:

    1. Will it be harder to maintain?
    2. How necessary is optimizing the program?
    Understanding optimization is important, but it's also important to understand when to optimize.

    For example in one script, I had to write the following regex:

    $input =~ /^(?:[^_]|_(?!${value}))+_${value},((?:[A-Z]\d{1,2},?)+).*/;
    Because of the nature of the data, I could have written (untested):
    $input =~ /$value,((?:\w\d{1,2},?)+)/;
    The second regex may not be easy to understand, but it's a heck of a lot easier to understand then the first. If the script didn't require maximum efficiency, I would have chosen the second for maintainability.

    Cheers,
    Ovid

      How does the rant go?

      Don't prematurely optimize!
      Don't prematurely optimize!
      Don't prematurely optimize!

      Instead modularize your code and after the fact, if it matters, you are in a position to recognize the big scale issues and redesign. Optimizing your operations won't give you order of magnitude improvements. Fixing your algorithms will.

Re: File reading efficiency and other surly remarks
by tye (Sage) on Aug 26, 2000 at 07:07 UTC

    I think the most important difference between slurping an entire file and reading line-by-line, is that it is pretty easy to have a file so large that slurping it will not just be slower than reading line-by-line, it will fail.

    Of somewhat lesser concern is that the line-by-line method scales linearly while the slurp method will start to slow down rather dramatically as the files start to get too large.

    So if you have an operation that can be done fairly efficiently in a line-by-line manner, I think you should almost always do it that way.

    If you are doing a small file, then the speed-up of slurp mode probably just isn't enough to make much difference. If you are doing large files, then you can't risk the possible huge slow down or out right failure.

    If you are doing large files and need as much speed as possible, then you often read and write files chunk-by-chunk (which has the extra advantage of working even when you have a file containing a single line that is too large to fit in your available virtual memory space). This requires the use of read() or sysread() and possibly syswrite(). (A "chunk" is usually a fixed-length and fairly large buffer, like 64K).

    But this gets complicated quickly. No simple answer is going to cover all cases.

            - tye (but my friends call me "Tye")
Re: File reading efficiency and other surly remarks
by chromatic (Archbishop) on Aug 26, 2000 at 05:00 UTC
    As is the answer to so many benchmarking questions, the answer is "It Depends."

    I tend to process most files line-by-line if they have line-based data. If they have information that can span lines (or records), I slurp up the whole thing. Depending on file size, available memory, and other processes, slurping isn't a good idea.

    In this case, line by line seems like it would be more efficient. You can stop reading when you hit the record you want. (Of course, if you'll be doing this sort of thing often, I'd put everything in a database or at least a tied hash, and let something besides Perl handle the searching -- probably a little faster.)

Re: File reading efficiency and other surly remarks
by lhoward (Vicar) on Aug 26, 2000 at 05:10 UTC
    Reading a file all at once will always be faster than reading it one line at a time. The problem with the all at once approach is that if you file is large it will consume a large amount of memory by loading the whole file into memory at once. If you want you can get the efficiency of the all at once method without the memory use problem you can use the read/sysread functions to read from the file a block at a time. Only problem with this is that detecting line-breaks isn't handled automatically for you. The code below is taken from an earlier perlmonks discussion about reading files a block at a time; this isn't my code so I can't take credit (or blame) for it.
    open(FILE, "<file") or die "error opening $!"; my $buf=''; my $leftover=''; while(read FILE, $buf, 4096) { $buf = $leftover.$buf; my @lines = split(/\n/, $buf); $leftover = ($buf !~ /\n$/) ? pop @lines : ""; foreach (@lines) { # process one line of data } } close(FILE);
    This example uses a read-block size of 4096 bytes. The optimal value will depend on your OS and filesystem's blocksize (among other things).
      Note: the open statement should have the filename in the debugging die on failure like it says in perlstyle. Also there are enough levels of buffering that I don't know that worrying about "optimal block size" really makes sense. And finally just letting Perl worry about the line by line is probably faster and more reliable IMO. It will do that buffering behind the scenes for you.

      OTOH I have used similar code when working with binary data. So the general technique is good to know.

        Good point. Since I also mentioned reading chunks at a time, I'll emphasize that this is not a good idea if you are going to split each chunk into lines.

        When you use Perl's <FILE>, Perl itself is reading the file as chunks and splitting them into lines to give to you. I can assure you that you can't do this faster in Perl code than Perl can do it itself. And the Perl code has been tested a lot more than any similar code you might write.

        Yes, Tilly already said all of this. I just didn't think he said it strong enough (and I felt guilty for suggesting chunk-by-chunk after not fully understanding a previous reply).

                - tye (but my friends call me "Tye")
Re: File reading efficiency and other surly remarks
by Hot Pastrami (Monk) on Aug 26, 2000 at 10:53 UTC
    Thanks for the data, guys... you made some very good points. I guess that the primary considerations are that A) Checking line-by-line allows the loop to quit when it finds a match, so the whole file need not be read unless the match is on the LAST line, and B) I am working with files of unknown sizes - although it is unlikely that any would be extremely large, it is possible. I'll embrace the joy that is line-by-line.

    By the way, I know my example code sucked... the real code is much nicer, but too involved to include in its entirety.

    Alan "Hot Pastrami" Bellows
    -Sitting calmly with scissors-

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://29760]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chanting in the Monastery: (5)
As of 2024-04-23 23:05 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found