Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

Efficiency: Recursive functions while de-subbing my program

by Endless (Beadle)
on Jul 12, 2013 at 21:10 UTC ( #1044079=perlquestion: print w/replies, xml ) Need Help??

Endless has asked for the wisdom of the Perl Monks concerning the following question:

Dear Esteemed Monks,

In an effort to squeeze the utmost speed out of a large-scale lexical processing program I'm working on, I encountered advice that subs are slow in Perl; I have two functions in my program: one that recursively traverses a directory structure finding files, and one that takes those files and processes them entry by entry. In an attempt for a speed-boost I'm trying to remove those subs and make them in-line loops. So, two questions:

Question 1: is it worth it?

in the end, my directory-search routine will be called over an upward bound of around 100,000 files in several directories. Each time, it will call the second routine (processing the individual files). Would the gains be likely to be worth making this a loop instead of a sub?

Question 2: how to rewrite File::Find recursive function calls?

The sub looks like this, with the recursive line being the 'find' line :

sub _process_files { my $file = shift; ... # .json file if (-f $file and ($file =~ /\.json\Z/)) { _process_json($file); } # Directory elsif (-d $file) { # for each file in directory, process find ( { wanted => sub { \&_process_files($File::Find::name) if -f +}, no_chdir => 1 }, "$file"); } }

Replies are listed 'Best First'.
Re: Efficiency: Recursive functions while de-subbing my program
by BrowserUk (Patriarch) on Jul 12, 2013 at 21:29 UTC
    is it worth it?

    No. The time you spend hitting the disk to retrieve directory entries will completely swap any gains from avoiding function calls.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Efficiency: Recursive functions while de-subbing my program
by zork42 (Monk) on Jul 13, 2013 at 10:35 UTC
    4 points:

    ===

    (1) I'm not sure you've structured your code correctly.

    Normally you don't use File::Find recursively.

    The typical use is:
    use File::Find; find(\&wanted, @directories_to_search); sub wanted { ... }
    ie you make 1 single call to find().
    Then find() calls the provided callback function wanted() for each file and directory it finds as it searches the provided directories.

    In your case you might have:
    use File::Find; find( { wanted => \&_process_files, no_chdir => 1 }, @directories_to_ +search); sub _process_files { my $file = $File::Find::name; # .json file if (-f $file and ($file =~ /\.json\Z/)) { _process_json($file); } }

    ===

    (2) Is your regexp correct?

    If you're interested in files with an extension of ".json", then it should look like this: $file =~ /\.json$/i

    ===

    (3) As everyone has said, in this case the function call time is tiny compared with the disk I/O time.

    ===

    (4) What I normally do if processing the files will take significantly more time than finding the files is this:
    1. find all the files of interest and store them in an array
    2. process each file in the array

    The advantage of this is in step 2 you can provide the user with feedback like:
    "Currently processing file 123 of 12345 files. Estimated time to finish is xxxx".

    To provide an accurate ETA, if the time to process a file is proportional to the file's size, you would need to save the size of each individual file as it is found in step 1.
      if the time to process a file is proportional to the file's size, you would need to save the size of each individual file
      Only if the files are not sized similarly.
      لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ
        That true.

        But the general case is so easy to implement, you might as well be general :)
Re: Efficiency: Recursive functions while de-subbing my program
by Laurent_R (Canon) on Jul 13, 2013 at 09:30 UTC

    I did not say that function call overhead is peanuts in general. I said that the overhead for 100,000 calls is peanuts, and I should perhaps have added, in that specific context.

    Just a quick test:

    $ time perl -e '$e = shift;sub inc{$c=shift; $c++;return $c;} $d=0; $d +=inc($d) while $d < $e; print $d;' 1e3 1000 real 0m0.073s user 0m0.030s sys 0m0.030s Laurent@Laurent-HP ~ $ time perl -e '$e = shift;sub inc{$c=shift; $c++;return $c;} $d=0; $d +=inc($d) while $d < $e; print $d;' 1e4 10000 real 0m0.065s user 0m0.046s sys 0m0.015s Laurent@Laurent-HP ~ $ time perl -e '$e = shift;sub inc{$c=shift; $c++;return $c;} $d=0; $d +=inc($d) while $d < $e; print $d;' 1e5 100000 real 0m0.108s user 0m0.046s sys 0m0.046s Laurent@Laurent-HP ~ $ time perl -e '$e = shift;sub inc{$c=shift; $c++;return $c;} $d=0; $d +=inc($d) while $d < $e; print $d;' 1e6 1000000 real 0m0.496s user 0m0.451s sys 0m0.030s Laurent@Laurent-HP ~ $ time perl -e '$e = shift;sub inc{$c=shift; $c++;return $c;} $d=0; $d +=inc($d) while $d < $e; print $d;' 1e7 10000000 real 0m4.320s user 0m4.274s sys 0m0.030s

    A tenth of a second for 100,000 calls on my laptop, I think it can be considered as peanuts (in fact, half of the time is really overhead for compile time, etc., it is more like about 0,04 sec. for the 100,000 calls, as shown by the calls with 1e6 and 1e7). Or less than half a microsecond per function call in this case.

    If that level of optimization is required, then, maybe, Perl is not the right language, changing the programming language should be considered. But the context given by the OP indicates that such a level of optimization is most probably not required.

    I am dealing daily with huge amounts of data (tens of gigabytes), and time optimization is one of my major concerns. If my program runs in, say, 5 hours, I am not interested in shortening its duration by 40 seconds through removing 100 million function calls. I am far more interested in finding a better way of doing the processing that will perhaps get that duration down to 1 hour, or possibly much less than that. And quite often, it can be done, for example by simply adding a simple hash that will enable me to filter out most of my data early in the processing.

    I made a talk on this subject in a Perl Mongers conference in France last month and gave some examples of just that type of optimization. My point in that talk was: why do I use (mostly) Perl if performance is one of my major goals? C would be faster, wouldn't it? Well, because I am not really interested in microsecond optimizations. I am far more interested in finding better ways of doing things, figuring out better algorithms, and Perl gives me the expressive power to do it quickly: where I need two hours to write and test a better (faster) algorithm in Perl, I may need two days or perhaps two weeks to do it in C or in Java, so that I would probably not even be given the budget to do it.

Re: Efficiency: Recursive functions while de-subbing my program
by Laurent_R (Canon) on Jul 12, 2013 at 22:18 UTC

    Nope. Most probably not worth it. 100,000 function calls is peanuts.

      100,000 function calls is peanuts.

      Peanuts are expensive:

      sub x{ my( $n ) = @_; $n };; cmpthese -1, { a => q[ my $n; $n += $_ for 1 .. 1e5 ], b => q[ my $n; $n += x($_) for 1 .. 1e5 ] };; Rate b a b 2.14/s -- -80% a 10.7/s 399% --

      That's 400% faster; a 4 times improvement; 1/5th of the runtime; simply by avoiding 100,000 function calls.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

        That's about 1/3rd of one second. 100k function calls are indeed peanuts. And peanuts are indeed cheap. Showing a peanut next to a dust mite doesn't actually make the peanut larger or more expensive.

        Doing nothing useful a whopping 400% faster is not actually a useful accomplishment. I can do nothing useful over 400,000,000% faster by actually doing nothing. That is one enormously impressive number. And it means nothing at all.

        - tye        

Re: Efficiency: Recursive functions while de-subbing my program
by mtmcc (Hermit) on Jul 13, 2013 at 05:51 UTC
    I'm with browser on this one. Also, if you sacrifice all your subroutines for small gains in speed, recycling your code would become much more painful.

    If 100,000 is your expected maximum, I doubt you would even notice speed gains if you culled your subs. But if you're not sure, there's no harm in trying it, and comparing the two cases...

Re: Efficiency: Recursive functions while de-subbing my program
by sundialsvc4 (Abbot) on Jul 13, 2013 at 02:08 UTC

    Either way, BrowserUK’s original observation is what holds water in this case:   this is an I/O bound operation, so CPU-twiddling isn’t going to translate to a material improvement.   (And the comment that “Perl subs are slow” is, to me, unreliable hearsay.)

    It would be useful to time the two functions separately.   How much time, and how much resources, does it actually take to run that “recursive” file search routine.   Try to predict how much time it, alone would take.   Could it, without your being aware of it, be taking more time/resources than it should?   Then, using a list of files that is prepared entirely in-advance, sample the amount of time that the file-processing subroutine requires.   Once more, to predict how much time it, alone would take to do the entire job.  Then, well, does this “jive” with your empirical observations of the actual combined program?   The most likely thing to improve actual runtime ... if it can be significantly improved at all ... will be an algorithm revision of some kind.

    Don’t “diddle” code to make it faster:   find a better algorithm.
    Kernighan & Plauger:   The Elements of Programming Style

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1044079]
Approved by Happy-the-monk
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others scrutinizing the Monastery: (4)
As of 2023-03-30 01:06 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Which type of climate do you prefer to live in?






    Results (73 votes). Check out past polls.

    Notices?