Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

Why study SCALAR?

by mrbbking (Hermit)
on Dec 26, 2001 at 06:35 UTC ( #134337=perlquestion: print w/replies, xml ) Need Help??

mrbbking has asked for the wisdom of the Perl Monks concerning the following question:

I'm looking for real-life uses of Perl's study function.

The discussion of study here at the Monastery is largely the same as what's in the Camel Book, and the description leaves me wondering when and why one might use it.

Before I get too long-winded, do you use study? What for, and why?

Replies are listed 'Best First'.
Re: Why study a SCALAR?
by clintp (Curate) on Dec 26, 2001 at 08:59 UTC
    To be perfectly honest, study was one of the very few places in the Perl Developer's Dictionary where I copped out and didn't include an example of the syntax in an actual program. Why?

    Following the guidelines in the Camel, online docs, and what I could grok of the source code, I kept coming up with seriously contrived uses. Okay, I did find a few uses that seemed to apply and didn't look so bad. So like any "optimization" scheme, I set up some benchmarks so that I could recommend it or at least further specify *exactly* where study benefits.

    By the time that I had a broad list of examples where study was beneficial, and a list where it didn't help (or even hurt) performance it would have taken several pages to explain *why* it works this way -- and along the way dancing around implementation details of the language that I really didn't care to explain. Very, very small changes in the input data would cause large swings in the benchmark timings. I didn't want a huge checklist of cases and exceptions with disclaimers making the whole thing moot anyway.

    So I documented what I found (which is essentially what the Camel 3ed and the docs say) but with even broader warnings and more vigorous handwaving.

      One area where I'm currently using regexes is a 'simulator' that I've written in Perl, which basically interprets another language (a process control language). The syntax is different, this other language allows for arbitrarily complex expressions (really hairy 2-page messes with plenty of parentheses nesting, etc... nothing I'd want to maintain, and I'm glad I don't have to), and it also provides for an IF statement which tests whether an expression's value has gone from 0 to 1 (an edge-triggered device, to a hardware person). So it's not a trivial one-for-one translation. Bear in mind that I'm using some of the regexes to *alter* the original line; in essence, I translate it into the Perl equivalent and then use eval() to 'execute' it. Is this a candidate for the use of study($line), given that the $line is changing along the way? (If so, I will attempt to see if there's a speedup; right now, it's executing tens of thousands of lines in a little over a minute and I've got timestamping which could tell if there's anything to be gained.)
Re: Why study SCALAR?
by LunaticLeo (Scribe) on Dec 26, 2001 at 22:20 UTC
    study() is intended to be used to help optimize regular expressions on the scalar.

    I have done benchmarks repeating the same regex on the scalar, and multiple regex's on the same scalar. I have never found a speedup.

    BTW, my benchmarks were like:

    use Benchmark qw(&cmpthese); $STUDIED_TEXT = $TEXT; study $STUDIED_TEXT; cmpthese($COUNT, { 'with study' => \&fn1, 'w/o study' => \&fn2 } );
    Basically, study() is an anachronism. Feel free to ignore it, everybody else does.
Re: Why study SCALAR?
by atcroft (Abbot) on Dec 27, 2001 at 01:14 UTC

    I considered the use of study() in a project at work, but was unable to find a sufficient increase in efficiency to use it (as the application I considered it for was a CGI searching for a limited amount of information). I did, however, test the use of study() again out of curiousity after reading the replies by clintp and LunaticLeo .

    My testing consisted of performing a search for the word "lease" in a large file (a sample taken from a DHCP server's leases file, consisting of 787'811 lines / 22'741'219 characters, the word occurring 69'474 times) using the code below. I wrote the results from the program to STDERR (to be able to filter them later), and tested 3 possibilities:

    1. without the use of study()
    2. using study() before the loop, and
    3. using study() within the loop, similar to the way used in the 2nd edition of the Camel book.
    My results (executed using 'perl test.pl 2>/dev/null') were as follows:
    Benchmark: timing 100 iterations of w/o study, with study, with study +in loop... w/o study: 346 wallclock secs (306.95 usr + 8.69 sys = 315.64 CPU) with study: 317 wallclock secs (301.91 usr + 8.53 sys = 310.44 CPU) with study in loop: 369 wallclock secs (347.37 usr + 8.39 sys = 355.7 +6 CPU)
    #!/usr/local/bin/perl -w -- use Benchmark qw(timethese clearallcache); $FILENAME = "datafile.txt"; $TEXT = "lease"; $STUDIED_TEXT = $TEXT; study($STUDIED_TEXT); $COUNT = 100; clearallcache; &timethese($COUNT, { 'with study' => \&fn1, 'w/o study' => \&fn2, 'with study in loop' => \&fn3 } ); sub fn1 { &mystat($FILENAME); print(STDERR "Searching for $STUDIED_TEXT\t"); open(DF, $FILENAME); my $count = 0; while ($line = <DF>) { $count++ if ($line =~ m/$STUDIED_TEXT/); } print(STDERR "fn1 : Lines found : $count\n"); close(DF); } sub fn2 { &mystat($FILENAME); print(STDERR "Searching for $TEXT\t"); open(DF, $FILENAME); my $count = 0; while ($line = <DF>) { $count++ if ($line =~ m/$TEXT/); } print(STDERR "fn2 : Lines found : $count\n"); close(DF); } sub fn3 { &mystat($FILENAME); print(STDERR "Searching for $TEXT\t"); open(DF, $FILENAME); my $count = 0; while ($line = <DF>) { study($TEXT); $count++ if ($line =~ m/$TEXT/); } print(STDERR "fn3 : Lines found : $count\n"); close(DF); } sub mystat { local($filename) = @_; print(STDERR "Filename : $filename\tSize : ", (stat($filename))[7], "\t"); }

    My results, however, might differ from that of others, had I had a search string with some characters more rare than others, and am still learning to Benchmark effectively. The moral to this (I believe) is that if you think it might prove helpful, Benchmark it and see, and remember, as always, YMMV.

    Update: I stand corrected by the experience and knowledge of chipmunk . Thank you chipmunk , for the correction to my understanding (or lack thereof).

    Update: After considering chipmunk's correction, I have edited and retested code to try to determine the effect of the study() statement. The new code is below, but I have left the code above as text for those who may learn from the correction, as I have. I utilized the same datafile as before. The new tests were:

    1. without use of study() or /o (on regex)
    2. without use of study() but with /o
    3. with study() without /o, and
    4. with study() and /o.
    The results were as follows:
    Benchmark: timing 100 iterations of w/o study or /o, w/o study with /o +, with study and /o, with study w/o /o... w/o study or /o: 352 wallclock secs (304.41 usr + 8.55 sys = 312.96 C +PU) w/o study with /o: 388 wallclock secs (253.90 usr + 8.33 sys = 262.23 + CPU) with study and /o: 881 wallclock secs (507.50 usr + 8.17 sys = 515.67 + CPU) with study w/o /o: 823 wallclock secs (597.40 usr + 8.31 sys = 605.71 + CPU)
    #!/usr/local/bin/perl -w -- use Benchmark qw(timethese clearallcache); $FILENAME = "datafile.txt"; $TEXT = "lease"; $COUNT = 100; clearallcache; &timethese($COUNT, { 'w/o study or /o' => \&fn1, 'w/o study with /o' => \&fn2, 'with study w/o /o' => \&fn3, 'with study and /o' => \&fn4 } ); sub fn1 { &mystat($FILENAME); print(STDERR "Searching for $TEXT\t"); open(DF, $FILENAME); my $count = 0; while ($line = <DF>) { $count++ if ($line =~ m/$TEXT/); } print(STDERR "fn1 : Lines found : $count\n"); close(DF); } sub fn2 { &mystat($FILENAME); print(STDERR "Searching for $TEXT\t"); open(DF, $FILENAME); my $count = 0; while ($line = <DF>) { $count++ if ($line =~ m/$TEXT/o); } print(STDERR "fn2 : Lines found : $count\n"); close(DF); } sub fn3 { &mystat($FILENAME); print(STDERR "Searching for $TEXT\t"); open(DF, $FILENAME); my $count = 0; while ($line = <DF>) { study($line); $count++ if ($line =~ m/$TEXT/); } print(STDERR "fn3 : Lines found : $count\n"); close(DF); } sub fn4 { &mystat($FILENAME); print(STDERR "Searching for $TEXT\t"); open(DF, $FILENAME); my $count = 0; while ($line = <DF>) { study($line); $count++ if ($line =~ m/$TEXT/o); } print(STDERR "fn4 : Lines found : $count\n"); close(DF); } sub mystat { local($filename) = @_; print(STDERR "Filename : $filename\tSize : ", (stat($filename))[7], "\t"); }

    Question: what effect could the caching in the Benchmark.pm module have on this code/results?

      study is meant to be used on the target string, not the regular expression!
      study($line); $line =~ /$TEXT/;
      Of course, study won't be a win if you're only going to perform a single match on the target string. And it turns out that it probably won't be a win even if you do a bunch of matches on the target string. The regular expression engine has had lots of optimizations added to it over time, making it pretty fast with or without the use of study.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://134337]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others avoiding work at the Monastery: (7)
As of 2020-02-20 20:05 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    What numbers are you going to focus on primarily in 2020?










    Results (92 votes). Check out past polls.

    Notices?