File reading efficiency and other surly remarks

Hot Pastrami has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
(Ovid - when not to optimize) Re: File reading efficiency and other surly remarks by Ovid (Cardinal) on Aug 26, 2000 at 06:10 UTC
One thing to keep in mind is that while striving for efficiency is important, it waste time profitably spent elsewhere. If you have a script that runs twice as long as it could, what does it matter if it only runs once a month? I'm not saying that's the case here. It's just something I like to keep in mind. Oftimes, while I find that I can write a more efficient regex than one I find in a program that I am working on, I ask myself two questions: Will it be harder to maintain? How necessary is optimizing the program? Understanding optimization is important, but it's also important to understand when to optimize. For example in one script, I had to write the following regex: `$input =~ /^(?:[^_]\|_(?!${value}))+_${value},((?:[A-Z]\d{1,2},?)+).*/;` [download] Because of the nature of the data, I could have written (untested): `$input =~ /$value,((?:\w\d{1,2},?)+)/;` [download] The second regex may not be easy to understand, but it's a heck of a lot easier to understand then the first. If the script didn't require maximum efficiency, I would have chosen the second for maintainability. Cheers, Ovid	[reply] [d/l] [select]
RE: (Ovid - when not to optimize) Re: File reading efficiency and other surly remarks by tilly (Archbishop) on Aug 26, 2000 at 08:02 UTC
How does the rant go? Don't prematurely optimize! Don't prematurely optimize! Don't prematurely optimize! Instead modularize your code and after the fact, if it matters, you are in a position to recognize the big scale issues and redesign. Optimizing your operations won't give you order of magnitude improvements. Fixing your algorithms will.	[reply]
Re: File reading efficiency and other surly remarks by tye (Sage) on Aug 26, 2000 at 07:07 UTC
I think the most important difference between slurping an entire file and reading line-by-line, is that it is pretty easy to have a file so large that slurping it will not just be slower than reading line-by-line, it will fail. Of somewhat lesser concern is that the line-by-line method scales linearly while the slurp method will start to slow down rather dramatically as the files start to get too large. So if you have an operation that can be done fairly efficiently in a line-by-line manner, I think you should almost always do it that way. If you are doing a small file, then the speed-up of slurp mode probably just isn't enough to make much difference. If you are doing large files, then you can't risk the possible huge slow down or out right failure. If you are doing large files and need as much speed as possible, then you often read and write files chunk-by-chunk (which has the extra advantage of working even when you have a file containing a single line that is too large to fit in your available virtual memory space). This requires the use of read() or sysread() and possibly syswrite(). (A "chunk" is usually a fixed-length and fairly large buffer, like 64K). But this gets complicated quickly. No simple answer is going to cover all cases. - tye (but my friends call me "Tye")	[reply]
Re: File reading efficiency and other surly remarks by chromatic (Archbishop) on Aug 26, 2000 at 05:00 UTC
As is the answer to so many benchmarking questions, the answer is "It Depends." I tend to process most files line-by-line if they have line-based data. If they have information that can span lines (or records), I slurp up the whole thing. Depending on file size, available memory, and other processes, slurping isn't a good idea. In this case, line by line seems like it would be more efficient. You can stop reading when you hit the record you want. (Of course, if you'll be doing this sort of thing often, I'd put everything in a database or at least a tied hash, and let something besides Perl handle the searching -- probably a little faster.)	[reply]
Re: File reading efficiency and other surly remarks by lhoward (Vicar) on Aug 26, 2000 at 05:10 UTC
Reading a file all at once will always be faster than reading it one line at a time. The problem with the all at once approach is that if you file is large it will consume a large amount of memory by loading the whole file into memory at once. If you want you can get the efficiency of the all at once method without the memory use problem you can use the read/sysread functions to read from the file a block at a time. Only problem with this is that detecting line-breaks isn't handled automatically for you. The code below is taken from an earlier perlmonks discussion about reading files a block at a time; this isn't my code so I can't take credit (or blame) for it. `open(FILE, "<file") or die "error opening $!"; my $buf=''; my $leftover=''; while(read FILE, $buf, 4096) { $buf = $leftover.$buf; my @lines = split(/\n/, $buf); $leftover = ($buf !~ /\n$/) ? pop @lines : ""; foreach (@lines) { # process one line of data } } close(FILE);` [download] This example uses a read-block size of 4096 bytes. The optimal value will depend on your OS and filesystem's blocksize (among other things).	[reply] [d/l]
RE (tilly) 2 (blame): File reading efficiency and other surly remarks by tilly (Archbishop) on Aug 26, 2000 at 07:58 UTC
Note: the open statement should have the filename in the debugging die on failure like it says in perlstyle. Also there are enough levels of buffering that I don't know that worrying about "optimal block size" really makes sense. And finally just letting Perl worry about the line by line is probably faster and more reliable IMO. It will do that buffering behind the scenes for you. OTOH I have used similar code when working with binary data. So the general technique is good to know.	[reply]
RE: RE (tilly) 2 (blame): File reading efficiency and other surly remarks by tye (Sage) on Aug 26, 2000 at 08:10 UTC
Good point. Since I also mentioned reading chunks at a time, I'll emphasize that this is not a good idea if you are going to split each chunk into lines. When you use Perl's `<FILE>`, Perl itself is reading the file as chunks and splitting them into lines to give to you. I can assure you that you can't do this faster in Perl code than Perl can do it itself. And the Perl code has been tested a lot more than any similar code you might write. Yes, Tilly already said all of this. I just didn't think he said it strong enough (and I felt guilty for suggesting chunk-by-chunk after not fully understanding a previous reply). - tye (but my friends call me "Tye")	[reply] [d/l]
RE: RE: RE (tilly) 2 (blame): File reading efficiency and other surly remarks by lhoward (Vicar) on Aug 26, 2000 at 17:41 UTC
RE (tilly) 5: File reading efficiency and other surly remarks by tilly (Archbishop) on Aug 26, 2000 at 18:37 UTC
Some notes below your chosen depth have not been shown here
RE: RE: RE: RE (tilly) 2 (blame): File reading efficiency and other surly remarks by tye (Sage) on Aug 26, 2000 at 19:57 UTC
Some notes below your chosen depth have not been shown here
Re: File reading efficiency and other surly remarks by Hot Pastrami (Monk) on Aug 26, 2000 at 10:53 UTC
Thanks for the data, guys... you made some very good points. I guess that the primary considerations are that A) Checking line-by-line allows the loop to quit when it finds a match, so the whole file need not be read unless the match is on the LAST line, and B) I am working with files of unknown sizes - although it is unlikely that any would be extremely large, it is possible. I'll embrace the joy that is line-by-line. By the way, I know my example code sucked... the real code is much nicer, but too involved to include in its entirety. Alan "Hot Pastrami" Bellows -Sitting calmly with scissors-	[reply]


Clear questions and runnable code get the best and fastest answer
	PerlMonks