dorpus has asked for the wisdom of the Perl Monks concerning the following question:
Hi, should there be any difference in performance between the following commands?
open(INFILE,"cat textfile |")
while(<INFILE>) {...}
open(INFILE,"textfile")
while(<INFILE>) {...}
system("cat textfile | filter.pl")
Re: cat vs. file handle speed?
by Adam (Vicar) on Mar 30, 2001 at 06:06 UTC
|
Think about the overhead involved. Here is my brief analysis:
open(INFILE,"cat textfile |")
while(<INFILE>) {...}
This opens a type of file handle commonly known as a pipe.
It spawns an additional process, complete with a duplicate set of environment variables and memory management requirements. The OS must now swap memory back and forth between Perl and cat.
open(INFILE,"textfile")
while(<INFILE>) {...}
Perl opens a file handle directly to the file. No other processes are started.
system("cat textfile | filter.pl")
Perl invokes the shell which invokes cat and another instance of Perl! Plus the shell still has to open a file handle for the output of cat / the input to filter.pl
Result: All three methods require a filehandle (aka a fileno, or a file descriptor) and two of the methods have the additional overhead of multiple processes. Use the second method and avoid all that. | [reply] [Watch: Dir/Any] [d/l] [select] |
Re: cat vs. file handle speed?
by the_slycer (Chaplain) on Mar 30, 2001 at 09:47 UTC
|
Well, someone had to do it right?
I haven't played with benchmarking much, but here is my contribution..
use Benchmark;
timethese (50000, {
'OPENCAT' => sub
{
open (INFILE, "cat mbox |");
while (<INFILE>){ #do nothing
}
close INFILE;
},
'OPENPERL' => sub
{
open (INFILE, "mbox");
while (<INFILE>){ #do nothing
}
close INFILE;
}
});
Results:
Benchmark: timing 50000 iterations of OPENCAT, OPENPERL...
OPENCAT: 287 wallclock secs (175.33 usr 19.89 sys + 46.32 cusr 44.78 csys = 286.32 CPU) @ 256.12/s (n=50000)
OPENPERL: 171 wallclock secs (168.59 usr + 2.50 sys = 171.09 CPU) @ 292.24/s (n=50000)
I dropped "system" out of it early on - due to the fact that it was at about the above
levels after only 1000 iterations :-)
| [reply] [Watch: Dir/Any] [d/l] |
|
Now explain your benchmark. :-)
When I answered before I knew full well that any of the
three could win, depending on OS, installed versions,
hardware, files, etc. The reason why cat wins here is
latency. In doing IO, every so often you may wind
up waiting for your request to get sorted. Well with the
pipe you can let cat do that waiting, and Perl can go on its
merry way.
This has to be weighed against the fact that it takes more
work to launch cat than it does to open a filehandle. Plus
operating systems take some pains to do for every process
what cat does for one. So the tradeoff is highly system
specific.
The third option, slowest for you by a country mile, can
win on very large files. Why? Well it turns out that
Perl is faster to read STDIN than arbitrary filehandles.
The third option arranges for Perl to be using STDIN. This
has to be weighed against the fact that it takes a lot more
work for Perl to be launched than cat.
Therefore in the right time and place, any of the three can
win on raw speed.
But you should definitely go with the second. No doubt
about it.
Why you ask?
Well it is the most portable answer, and with the second
you can check failures and $! is populated correctly. This
key information has been lost for the other 2. Besides
which if you really ran out of performance, by using the
second and then naively parallelizing by running a fixed
number of copies on different files, you would get the
best overall throughput.
There is exactly one circumstance where I have, or would,
recommend something different. If you are on a system
where Perl does not have large file support but cat does
(this is now a compile-time option for Perl, but some
systems may still fit that description) then the first
option will allow Perl to work on files of size over
2 GB.
So the summary is that any of the three can win on raw
performance, but for portability and error checking you
really want to use the native method. (Which is the
prioritization that I hinted at above. But you should
not need to know all of this, that prioritization is
usually right in the end.)
Any questions?
| [reply] [Watch: Dir/Any] |
Re: cat vs. file handle speed?
by extremely (Priest) on Mar 30, 2001 at 05:48 UTC
|
| [reply] [Watch: Dir/Any] |
Re: cat vs. file handle speed?
by petral (Curate) on Mar 30, 2001 at 06:01 UTC
|
Has anyone tried using Mmap?
  (since we're on the subject of file-reading speed)
p | [reply] [Watch: Dir/Any] |
Re: cat vs. file handle speed?
by Malkavian (Friar) on Mar 30, 2001 at 15:49 UTC
|
Quick answer:
In an ideal world (and most cases), Perl will beat cat in a file read.
Caveat:
If you're running on Linux, this isn't the case, and cat is actually faster. See the enlightened node by tye on this subject here. A minor work around (read ugly hack) to get Linux to work faster was to use a read statement, and break down the block into lines using a reader object. Seems to work for Linux, but will seriously slow down other OSes.
Malk | [reply] [Watch: Dir/Any] |
|
Well, my analysis applies if cat uses "stdio.h" to read the file (which probably depends on the breed of cat that you have).
But that doesn't matter in this case because even if cat is faster than Perl, Perl would still have to read the output from cat. So X+Y is always bigger than just Y (since a process can't consume negative resources), whether X<Y or Y<X.
-
tye
(but my friends call me "Tye")
| [reply] [Watch: Dir/Any] [d/l] [select] |
Re (tilly) 1: cat vs. file handle speed?
by tilly (Archbishop) on Mar 30, 2001 at 06:02 UTC
|
2/3 of these are syntax errors.
None of them have error checks.
Those issues are more important than the miniscule speed
differences...
UPDATE
(Response to Adam.)
I find that what people do in pseudo-code, they do in
real code as well. Error checks should be a reflex. | [reply] [Watch: Dir/Any] |
|
I think exceptions can be made for pseudo-code Tilly. Focus on the question.
Update
Tilly says, "I find that what people do in pseudo-code, they do in real code as well. Error checks should be a reflex."
I completly disagree. Since I never use open without an or die, I never bother with it in pseudo-code. pseudo-code is specifically for reducing the algorithm to its core piece. Your later argument that two of the methods lost $! is a good one, but your argument that the examples presented contain no error checking is irrelevant. More to the point, your initial statement, "2/3 of these are syntax errors," is also out of place and inappropriate. You took an honest question and made two pointless statements as an attempt to say, "dorpus, you are asking the wrong question." I worry about that kind of response as it does nothing to assist our fellow monks and undermines your own credability as one of our more learned and informed monks. I've enjoyed our discussions here, and I think you have much to contribute. But your response here was wrong and I have no qualms about pointing that out.
| [reply] [Watch: Dir/Any] |
|
I suspect that we will disagree on this then. My point was
that if you are at a point where you do not - by reflex -
get the syntax right and put the error checks in, then the
optimization question is not where you need to concentrate.
And, of course, if you do try to fix those items
then you should quickly discover my real point, which
is that error reporting with 2 of the solutions is made
much harder. That fact is one of the key reasons why it
is a bad idea to write Perl as a glorified shell.
As for whether $! was a later argument, well I don't think
so. You see I am in the habit of giving answers where you
are unlikely to see the point of the answer unless you try
it. If you try it you will discover that for yourself,
and I believe that makes it stick better. If you do not
try it, well my typing it wouldn't have helped because you
would have just forgotten that as well.
So yes, I was attempting to make it clear that dorpus
was asking the wrong question. That was not an accident,
that was the point. And I think that what I said does
help our fellow monks. Why? Because it tells them what
I think is important. I believe that if they value what
I think is important here, that will be helpful. It may
not be the help that was requested, but I am (in case you
had not noticed) someone who tries to give the help that
I think does the most good, even if it is not the help that
was asked for.
Furthermore when I first answered I gave concious thought
to the question of whether I should answer the
question as posed. You see I knew from the start that
any of the three could beat the other two in practice.
I sincerely thought about saying that up front, but I
decided that it would obscure the critical point.
And the critical point is that 99.9% of the time this is
the wrong question to ask. The remaining 0.1% of
the time, if you ask it and think carefully, you will
come to the same answer that you would have come to if
you had asked the right question in the first place.
Therefore I thought it justified to only say what I
considered to be key. Which is that until you
reflexively get the syntax right and reach for the
error check, it is more important to focus on those things
than worrying about raw performance.
Now if this undermines my credibility, then so be it.
And continuing on, you may be different, but I used to
claim that I never opened without a die in real code. But
first I realized that if I showed my pseudo-code to others
I wanted to put the die in so that they would not
accidentally copy that. Then one day I caught myself
missing that detail converting my own pseudo-code
into production code. I then sat back, thought, and made
the concious decision to always use it, even in pseudo-code,
because I didn't want to accidentally pick up and
use bad habits moving to production code.
So YMMV, but what I do in pseudo-code I tend to do in
production as well. So habits I want to have in my
production code I try to stick to in pseudo-code.
| [reply] [Watch: Dir/Any] |
|
Re: cat vs. file handle speed?
by indigo (Scribe) on Mar 30, 2001 at 05:55 UTC
|
Think about it...if the cat version was faster, why would the Perl authors bother to write a vanilla version that was slower? | [reply] [Watch: Dir/Any] |
|
|