Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

Re: Re: Re: Is too little too much? Coding under the microscope...

by ChemBoy (Priest)
on Jun 28, 2001 at 10:28 UTC ( [id://92201]=note: print w/replies, xml ) Need Help??


in reply to Re: Re: Is too little too much? Coding under the microscope...
in thread Is too little too much? Coding under the microscope...

You are on the right track, but missing a couple of critical steps, which leaves you throwing out data that you don't want to throw out.

I'll walk through the first version you proposed, because I like it better (I personally find that style easier to read than the other one you mentioned, and it's also closer to being right):

opendir(LOGS,"."); sort @val = map { -M $_ } grep(/.*\.log/,readdir(LOGS));

Problem one: sort returns a (sorted) list, which you're discarding. So *if* everything else were right (we'll get there), you would want

@val = sort map {-M $_} grep {...

To see the bigger problem, we break it down a bit more, working from right to left down that line:

@tmp1 = readdir LOGS; # @tmp now contains the files listed in "." @tmp2 = grep {/.*\.log/} @tmp1; #@ tmp2 now contains files that contain ".log" @tmp3 = map {-M $_} @tmp2; #@tmp3 now contains what? @val = sort @tmp3;
If you re-read map, you'll discover that the list it generates is simply the result of the block {-M $_} applied to each value in @tmp2: since -M returns the age of the file, you now have a list of the ages of the log files in the directory, but *not* their names. Sort that list, and you have a sorted list of ages, but with no names associated with them. This is not what you wanted. :-)

What you *do* want is to associate age information with each of the items in @tmp1, sort on that, and get back the sorted version of @tmp1, which can be done thus:

use strict; #you are using strict, aren't you? ;) my @val = map { $_->[0] } sort { $a->[1] <=> $b->[1] } map { [$_, -M $_] } grep { /\.log$/ } readdir LOGS;
(assuming I didn't drop any critical concepts in there). You may note that I implemented chromatic's suggestion (well, almost) and tagged a $ into the regex to match the end of the file name (so "foo.logger" doesn't match).

This is what's known as a Schwartzian Transform, a term you may or may not have encountered here before. It is very useful. It is also slightly strange to the naked eye--if it makes instant sense to you, congratulations! And if not, take some time to work it out (there are a couple good explanations attached to that link) , because it's a cool technique.



If God had meant us to fly, he would *never* have give us the railroads.
    --Michael Flanders

Replies are listed 'Best First'.
Re:{4} Is too little too much? Coding under the microscope...
by jeroenes (Priest) on Jun 28, 2001 at 14:21 UTC
    Nice post. However, this example is a bit overdone ;-}. You use a Schwartzian trans-/deform logic. That's fine, as it effectively implements pope's cache. However, you can leave it out by doing the -M in sort.

    I don't know how efficient perl's array/hash is versus the system calls. Let's try it!

    #!/usr/bin/perl -w use strict; use Benchmark; opendir LOGS, "."; my @logs = readdir LOGS; my $cmp = timethese( -1, { 'Cached' => sub{ my @sorted = map { $_->[0] } sort { $a->[1] <=> $b->[1] } map { [$_, -M $_] } # grep { /\.log$/ } @logs; }, 'Raw' => sub { my @sorted = sort { -M $a <=> -M $b } # grep { /\.log$/ } @logs; } } ); Benchmark::cmpthese( $cmp );
    I had to comment out the greps and run it in /usr/bin just to get myself a decent amount of files (1965). Run under linux RH7 (.16 kernel)/ perl5.6/ ext2:
    Benchmark: running Cached, Raw, each for at least 1 CPU seconds... Cached: 2 wallclock secs ( 0.90 usr + 0.15 sys = 1.05 CPU) @ 18 +.10/s (n=19) Raw: 2 wallclock secs ( 0.45 usr + 0.64 sys = 1.09 CPU) @ 8 +.26/s (n=9) Rate Raw Cached Raw 8.26/s -- -54% Cached 18.1/s 119% -- #that's too little iterations.... Benchmark: timing 1000 iterations of Cached, Raw... Cached: 105 wallclock secs (44.91 usr + 7.74 sys = 52.65 CPU) @ 1 +8.99/s (n=1000) Raw: 258 wallclock secs (47.15 usr + 78.43 sys = 125.58 CPU) @ + 7.96/s (n=1000) Rate Raw Cached Raw 7.96/s -- -58% Cached 19.0/s 139% --
    As could be expected, the caching is more efficient, but the raw is only two times slower. All the extra time is spend on system time, the usr (perl) time is fairly indifferent.

    It might have something to do with the other jobs my box is running (quite memory/ disk demandings ones...) I will update after those are finished. Update: Doesn't matter. Same results on the system idling.

    I'm curious about other setups ... So if anyone'd like to run them....

    Cheers,

    Jeroen

Re: Re: Re: Re: Is too little too much? Coding under the microscope...
by snafu (Chaplain) on Jun 28, 2001 at 10:51 UTC
    I couldn't have asked for a clearer explanation. Honestly, I do not quite grasp the Schwartzian Transform. Then again, I have only heard of it for the first time tonight. I intend to read up on it tomorrow. Chromatic did 'splain a lil to me as well. Your breakdown here answers a ton of questions I had. Thank you very much for patience and wisdom.

    ----------
    - Jim

      The quick breakdown of the Schwartzian Transform is this:

      A typical sort operation in perl takes on the order of N * log N operations; each operation requires comparing two values of an array to do this. In perl, the values are not known beforehand and can be instead calculated on the fly. The calculation step will be performed twice for every one of those (roughly) N*log N operations, so if you have a long calculation, you're going to have a really slow sort.

      What the Schwartzian Transform does is precalculates all of those values that you want sorted, and creates an array of arrays, one element being the initial position in the array that is to be sorted, the other being the calculated value of that element. The intermediate sort sorts on the precalculated values (which is fast), which also changes those index positions around, and then the final step maps the original array into a new one, using the adjusted index positions. You will now only calculate the value of each element in the array N times (exactly), as opposed to ~2*N*log N times. This can be a huge speed up.

      If this is still confusing, here's a good real world example: say you have your stack of monthly bills, which you keep in their original envelopes so they don't get lost. You want to sort the bills in either date or amount order. One way is to open the envelope, pull out the bill, note the value you want, put the bill back in the envelope, and work on sorting in this fashion. If you had more than 10 bills, this would become tiresome. Alternatively, for each bill, you can write the due date and amount due on the outside of the envelope (thus requiring you to look at each bill only once), and then you only need to look at the outside of the envelope to sort the bills. The latter will be much faster with a large pile of bills.


      Dr. Michael K. Neylon - mneylon-pm@masemware.com || "You've left the lens cap of your mind on again, Pinky" - The Brain

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://92201]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others contemplating the Monastery: (4)
As of 2024-04-19 16:53 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found