http://www.perlmonks.org?node_id=134485


in reply to Why study SCALAR?

I considered the use of study() in a project at work, but was unable to find a sufficient increase in efficiency to use it (as the application I considered it for was a CGI searching for a limited amount of information). I did, however, test the use of study() again out of curiousity after reading the replies by clintp and LunaticLeo .

My testing consisted of performing a search for the word "lease" in a large file (a sample taken from a DHCP server's leases file, consisting of 787'811 lines / 22'741'219 characters, the word occurring 69'474 times) using the code below. I wrote the results from the program to STDERR (to be able to filter them later), and tested 3 possibilities:

  1. without the use of study()
  2. using study() before the loop, and
  3. using study() within the loop, similar to the way used in the 2nd edition of the Camel book.
My results (executed using 'perl test.pl 2>/dev/null') were as follows:
Benchmark: timing 100 iterations of w/o study, with study, with study +in loop... w/o study: 346 wallclock secs (306.95 usr + 8.69 sys = 315.64 CPU) with study: 317 wallclock secs (301.91 usr + 8.53 sys = 310.44 CPU) with study in loop: 369 wallclock secs (347.37 usr + 8.39 sys = 355.7 +6 CPU)
#!/usr/local/bin/perl -w -- use Benchmark qw(timethese clearallcache); $FILENAME = "datafile.txt"; $TEXT = "lease"; $STUDIED_TEXT = $TEXT; study($STUDIED_TEXT); $COUNT = 100; clearallcache; &timethese($COUNT, { 'with study' => \&fn1, 'w/o study' => \&fn2, 'with study in loop' => \&fn3 } ); sub fn1 { &mystat($FILENAME); print(STDERR "Searching for $STUDIED_TEXT\t"); open(DF, $FILENAME); my $count = 0; while ($line = <DF>) { $count++ if ($line =~ m/$STUDIED_TEXT/); } print(STDERR "fn1 : Lines found : $count\n"); close(DF); } sub fn2 { &mystat($FILENAME); print(STDERR "Searching for $TEXT\t"); open(DF, $FILENAME); my $count = 0; while ($line = <DF>) { $count++ if ($line =~ m/$TEXT/); } print(STDERR "fn2 : Lines found : $count\n"); close(DF); } sub fn3 { &mystat($FILENAME); print(STDERR "Searching for $TEXT\t"); open(DF, $FILENAME); my $count = 0; while ($line = <DF>) { study($TEXT); $count++ if ($line =~ m/$TEXT/); } print(STDERR "fn3 : Lines found : $count\n"); close(DF); } sub mystat { local($filename) = @_; print(STDERR "Filename : $filename\tSize : ", (stat($filename))[7], "\t"); }

My results, however, might differ from that of others, had I had a search string with some characters more rare than others, and am still learning to Benchmark effectively. The moral to this (I believe) is that if you think it might prove helpful, Benchmark it and see, and remember, as always, YMMV.

Update: I stand corrected by the experience and knowledge of chipmunk . Thank you chipmunk , for the correction to my understanding (or lack thereof).

Update: After considering chipmunk's correction, I have edited and retested code to try to determine the effect of the study() statement. The new code is below, but I have left the code above as text for those who may learn from the correction, as I have. I utilized the same datafile as before. The new tests were:

  1. without use of study() or /o (on regex)
  2. without use of study() but with /o
  3. with study() without /o, and
  4. with study() and /o.
The results were as follows:
Benchmark: timing 100 iterations of w/o study or /o, w/o study with /o +, with study and /o, with study w/o /o... w/o study or /o: 352 wallclock secs (304.41 usr + 8.55 sys = 312.96 C +PU) w/o study with /o: 388 wallclock secs (253.90 usr + 8.33 sys = 262.23 + CPU) with study and /o: 881 wallclock secs (507.50 usr + 8.17 sys = 515.67 + CPU) with study w/o /o: 823 wallclock secs (597.40 usr + 8.31 sys = 605.71 + CPU)
#!/usr/local/bin/perl -w -- use Benchmark qw(timethese clearallcache); $FILENAME = "datafile.txt"; $TEXT = "lease"; $COUNT = 100; clearallcache; &timethese($COUNT, { 'w/o study or /o' => \&fn1, 'w/o study with /o' => \&fn2, 'with study w/o /o' => \&fn3, 'with study and /o' => \&fn4 } ); sub fn1 { &mystat($FILENAME); print(STDERR "Searching for $TEXT\t"); open(DF, $FILENAME); my $count = 0; while ($line = <DF>) { $count++ if ($line =~ m/$TEXT/); } print(STDERR "fn1 : Lines found : $count\n"); close(DF); } sub fn2 { &mystat($FILENAME); print(STDERR "Searching for $TEXT\t"); open(DF, $FILENAME); my $count = 0; while ($line = <DF>) { $count++ if ($line =~ m/$TEXT/o); } print(STDERR "fn2 : Lines found : $count\n"); close(DF); } sub fn3 { &mystat($FILENAME); print(STDERR "Searching for $TEXT\t"); open(DF, $FILENAME); my $count = 0; while ($line = <DF>) { study($line); $count++ if ($line =~ m/$TEXT/); } print(STDERR "fn3 : Lines found : $count\n"); close(DF); } sub fn4 { &mystat($FILENAME); print(STDERR "Searching for $TEXT\t"); open(DF, $FILENAME); my $count = 0; while ($line = <DF>) { study($line); $count++ if ($line =~ m/$TEXT/o); } print(STDERR "fn4 : Lines found : $count\n"); close(DF); } sub mystat { local($filename) = @_; print(STDERR "Filename : $filename\tSize : ", (stat($filename))[7], "\t"); }

Question: what effect could the caching in the Benchmark.pm module have on this code/results?

Replies are listed 'Best First'.
Re: Re: Why study SCALAR?
by chipmunk (Parson) on Dec 27, 2001 at 02:52 UTC
    study is meant to be used on the target string, not the regular expression!
    study($line); $line =~ /$TEXT/;
    Of course, study won't be a win if you're only going to perform a single match on the target string. And it turns out that it probably won't be a win even if you do a bunch of matches on the target string. The regular expression engine has had lots of optimizations added to it over time, making it pretty fast with or without the use of study.