Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

cat vs. file handle speed?

by dorpus (Novice)
on Mar 30, 2001 at 05:19 UTC ( [id://68255]=perlquestion: print w/replies, xml ) Need Help??

dorpus has asked for the wisdom of the Perl Monks concerning the following question:

Hi, should there be any difference in performance between the following commands?
open(INFILE,"cat textfile |") while(<INFILE>) {...} open(INFILE,"textfile") while(<INFILE>) {...} system("cat textfile | filter.pl")

Replies are listed 'Best First'.
Re: cat vs. file handle speed?
by Adam (Vicar) on Mar 30, 2001 at 06:06 UTC
    Think about the overhead involved. Here is my brief analysis:
    open(INFILE,"cat textfile |") while(<INFILE>) {...}
    This opens a type of file handle commonly known as a pipe. It spawns an additional process, complete with a duplicate set of environment variables and memory management requirements. The OS must now swap memory back and forth between Perl and cat.
    open(INFILE,"textfile") while(<INFILE>) {...}
    Perl opens a file handle directly to the file. No other processes are started.
    system("cat textfile | filter.pl")
    Perl invokes the shell which invokes cat and another instance of Perl! Plus the shell still has to open a file handle for the output of cat / the input to filter.pl

    Result: All three methods require a filehandle (aka a fileno, or a file descriptor) and two of the methods have the additional overhead of multiple processes. Use the second method and avoid all that.

Re: cat vs. file handle speed?
by the_slycer (Chaplain) on Mar 30, 2001 at 09:47 UTC
    Well, someone had to do it right?
    I haven't played with benchmarking much, but here is my contribution..
    use Benchmark; timethese (50000, { 'OPENCAT' => sub { open (INFILE, "cat mbox |"); while (<INFILE>){ #do nothing } close INFILE; }, 'OPENPERL' => sub { open (INFILE, "mbox"); while (<INFILE>){ #do nothing } close INFILE; } });
    Results:

    Benchmark: timing 50000 iterations of OPENCAT, OPENPERL...
    OPENCAT: 287 wallclock secs (175.33 usr 19.89 sys + 46.32 cusr 44.78 csys = 286.32 CPU) @ 256.12/s (n=50000)
    OPENPERL: 171 wallclock secs (168.59 usr + 2.50 sys = 171.09 CPU) @ 292.24/s (n=50000)

    I dropped "system" out of it early on - due to the fact that it was at about the above levels after only 1000 iterations :-)
      Now explain your benchmark. :-)

      When I answered before I knew full well that any of the three could win, depending on OS, installed versions, hardware, files, etc. The reason why cat wins here is latency. In doing IO, every so often you may wind up waiting for your request to get sorted. Well with the pipe you can let cat do that waiting, and Perl can go on its merry way.

      This has to be weighed against the fact that it takes more work to launch cat than it does to open a filehandle. Plus operating systems take some pains to do for every process what cat does for one. So the tradeoff is highly system specific.

      The third option, slowest for you by a country mile, can win on very large files. Why? Well it turns out that Perl is faster to read STDIN than arbitrary filehandles. The third option arranges for Perl to be using STDIN. This has to be weighed against the fact that it takes a lot more work for Perl to be launched than cat.

      Therefore in the right time and place, any of the three can win on raw speed.

      But you should definitely go with the second. No doubt about it.

      Why you ask?

      Well it is the most portable answer, and with the second you can check failures and $! is populated correctly. This key information has been lost for the other 2. Besides which if you really ran out of performance, by using the second and then naively parallelizing by running a fixed number of copies on different files, you would get the best overall throughput.

      There is exactly one circumstance where I have, or would, recommend something different. If you are on a system where Perl does not have large file support but cat does (this is now a compile-time option for Perl, but some systems may still fit that description) then the first option will allow Perl to work on files of size over 2 GB.

      So the summary is that any of the three can win on raw performance, but for portability and error checking you really want to use the native method. (Which is the prioritization that I hinted at above. But you should not need to know all of this, that prioritization is usually right in the end.)

      Any questions?

Re: cat vs. file handle speed?
by extremely (Priest) on Mar 30, 2001 at 05:48 UTC
    Short answer? Yes. Opening a pipe and spawning `cat' isn't ever a win...

    --
    $you = new YOU;
    honk() if $you->love(perl)

Re: cat vs. file handle speed?
by petral (Curate) on Mar 30, 2001 at 06:01 UTC
    Has anyone tried using Mmap?
      (since we're on the subject of file-reading speed)

    p
Re: cat vs. file handle speed?
by Malkavian (Friar) on Mar 30, 2001 at 15:49 UTC
    Quick answer:
    In an ideal world (and most cases), Perl will beat cat in a file read.
    Caveat:
    If you're running on Linux, this isn't the case, and cat is actually faster. See the enlightened node by tye on this subject here.
    A minor work around (read ugly hack) to get Linux to work faster was to use a read statement, and break down the block into lines using a reader object. Seems to work for Linux, but will seriously slow down other OSes.
    Malk

      Well, my analysis applies if cat uses "stdio.h" to read the file (which probably depends on the breed of cat that you have).

      But that doesn't matter in this case because even if cat is faster than Perl, Perl would still have to read the output from cat. So X+Y is always bigger than just Y (since a process can't consume negative resources), whether X<Y or Y<X.

              - tye (but my friends call me "Tye")
Re (tilly) 1: cat vs. file handle speed?
by tilly (Archbishop) on Mar 30, 2001 at 06:02 UTC
    2/3 of these are syntax errors.

    None of them have error checks.

    Those issues are more important than the miniscule speed differences...

    UPDATE
    (Response to Adam.)
    I find that what people do in pseudo-code, they do in real code as well. Error checks should be a reflex.

      I think exceptions can be made for pseudo-code Tilly. Focus on the question.

      Update
      Tilly says, "I find that what people do in pseudo-code, they do in real code as well. Error checks should be a reflex."
      I completly disagree. Since I never use open without an or die, I never bother with it in pseudo-code. pseudo-code is specifically for reducing the algorithm to its core piece. Your later argument that two of the methods lost $! is a good one, but your argument that the examples presented contain no error checking is irrelevant. More to the point, your initial statement, "2/3 of these are syntax errors," is also out of place and inappropriate. You took an honest question and made two pointless statements as an attempt to say, "dorpus, you are asking the wrong question." I worry about that kind of response as it does nothing to assist our fellow monks and undermines your own credability as one of our more learned and informed monks. I've enjoyed our discussions here, and I think you have much to contribute. But your response here was wrong and I have no qualms about pointing that out.

        I suspect that we will disagree on this then. My point was that if you are at a point where you do not - by reflex - get the syntax right and put the error checks in, then the optimization question is not where you need to concentrate. And, of course, if you do try to fix those items then you should quickly discover my real point, which is that error reporting with 2 of the solutions is made much harder. That fact is one of the key reasons why it is a bad idea to write Perl as a glorified shell.

        As for whether $! was a later argument, well I don't think so. You see I am in the habit of giving answers where you are unlikely to see the point of the answer unless you try it. If you try it you will discover that for yourself, and I believe that makes it stick better. If you do not try it, well my typing it wouldn't have helped because you would have just forgotten that as well.

        So yes, I was attempting to make it clear that dorpus was asking the wrong question. That was not an accident, that was the point. And I think that what I said does help our fellow monks. Why? Because it tells them what I think is important. I believe that if they value what I think is important here, that will be helpful. It may not be the help that was requested, but I am (in case you had not noticed) someone who tries to give the help that I think does the most good, even if it is not the help that was asked for.

        Furthermore when I first answered I gave concious thought to the question of whether I should answer the question as posed. You see I knew from the start that any of the three could beat the other two in practice. I sincerely thought about saying that up front, but I decided that it would obscure the critical point.

        And the critical point is that 99.9% of the time this is the wrong question to ask. The remaining 0.1% of the time, if you ask it and think carefully, you will come to the same answer that you would have come to if you had asked the right question in the first place. Therefore I thought it justified to only say what I considered to be key. Which is that until you reflexively get the syntax right and reach for the error check, it is more important to focus on those things than worrying about raw performance.

        Now if this undermines my credibility, then so be it.

        And continuing on, you may be different, but I used to claim that I never opened without a die in real code. But first I realized that if I showed my pseudo-code to others I wanted to put the die in so that they would not accidentally copy that. Then one day I caught myself missing that detail converting my own pseudo-code into production code. I then sat back, thought, and made the concious decision to always use it, even in pseudo-code, because I didn't want to accidentally pick up and use bad habits moving to production code.

        So YMMV, but what I do in pseudo-code I tend to do in production as well. So habits I want to have in my production code I try to stick to in pseudo-code.

Re: cat vs. file handle speed?
by indigo (Scribe) on Mar 30, 2001 at 05:55 UTC
    Think about it...if the cat version was faster, why would the Perl authors bother to write a vanilla version that was slower?

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://68255]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chanting in the Monastery: (6)
As of 2024-03-29 09:53 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found