Limbic~Region has asked for the wisdom of the Perl Monks concerning the following question:
All:
I have spent the last 3 weeks converting a suite of shells scripts to Perl. The purpose of this can be found here, although two of my initial requirements changed after a very long hard look at the transient files I was checking against.
Only the first 64K of the file needs to be read - if the string I am looking for is not there, it doesn't matter if it is elswhere in the file.
Removing imbedded newlines wasn't a real requirement since a match in the first 64K will never have the newline problem.
The following is the final code
#!/usr/bin/perl -w
use strict;
use Time::Local;
chdir "/var/spool/wt400/gateways/" . $ARGV[0];
mkdir "capture", 0755 unless (-d "capture");
my $ListTime = 0;
my %Traps;
my @Files;
my $Counter = 1;
my $Size;
my $Now;
my $NF;
my $Matcher;
my $Match_code;
open (LOG, ">>/disk4/Logs/traps/" . $ARGV[0] . "_" . $ARGV[1] . ".log"
+);
flock(LOG,(2|4)) or exit;
select LOG;
while (1) {
if ($Counter > 20 || ! %Traps) {
if ( (stat("traplist." . $ARGV[1]))[9] gt $ListTime ) {
$ListTime = (stat(_))[9];
%Traps = ();
open (LIST,"traplist." . $ARGV[1]);
while (<LIST>) {
next if (/^#/ || /^Created\t\tExpires/ || /^\s*$/);
my @Fields = split "\t" , $_;
next unless (@Fields == 8);
chomp $Fields[7];
my($mon, $day, $year, $hour, $min) = split ?[-/:]? , $Fields[1
+];
my $Expiration = timelocal(0, $min, $hour, $day, $mon - 1, $y
+ear + 100);
$Traps{$Fields[7]} = [ $Expiration,@Fields[2,5,6] ];
}
close (LIST);
}
$Counter = 1;
}
$Now = time;
$Match_code = "";
$Size = 0;
foreach my $Trap (keys %Traps) {
unless ($Traps{$Trap}[0] < $Now && $Traps{$Trap}[1]) {
if ($Traps{$Trap}[3] eq "SIZE") {
$Size = $Traps{$Trap}[2] if ($Traps{$Trap}[2] > 0);
}
else {
$Trap =~ s/(\W)/\\$1/g;
$Trap = "(?i-xsm)" . $Trap;
$Match_code .= "return \"$Trap\" if \$_[0] =~ /$Trap/;";
}
}
}
exit unless ($Match_code || $Size);
$Matcher = eval "sub {" . $Match_code . "}";
if ($ARGV[1] eq "out") {
@Files = <out/do*>;
}
elsif ($ARGV[1] eq "in") {
@Files = <in/di*>;
}
else {
@Files = <out/do* in/di*>
}
matchfile(\@Files);
$Counter++;
sleep 3;
}
sub matchfile {
local($/) = \65536;
FILE:
while (my $File = shift @{$_[0]}) {
if ($Size && -s $File >= $Size) {
($NF = $File) =~ s/^.*\///;
rename $File , "capture/" . $NF . "-SIZE";
print time . " " . $NF . " " . (stat(_))[7] . " SIZE\n";
next FILE;
}
unless (open(FILE, $File)) {
next FILE;
}
while (<FILE>) {
my $Match = $Matcher->($_);
if ($Match) {
$Match =~ s/\(\?i-xsm\)//;
($NF = $File) =~ s/^.*\///;
rename $File , "capture/" . $NF . "-" . $Traps{$Match}[3];
print time . " " . $NF . " " . (stat(_))[7] . " " . $Traps{$Ma
+tch}[3] . "\n";
}
next FILE;
}
}
}
The traplist file that the data is read from looks like:
Created Expires Use Type Author Size Name Trap
07:36:56-07:36 07:36:56-07:36 1 0 XYZ 98765 SIZE N/A
07:36:56-07:36 07:36:56-07:36 1 0 XYZ N/A TRAP1 cool things to look for
The first arg is the name of the directory to look for the traplist file in as well as the base directory to work from based on arg 2.
The second arg gives the second piece of information to find the traplist file as well as the directory to work in
If arg1 = blah, you would look for the traplist file in /var/spool/wt400/gateways/blah
If arg2 = out, you would open /var/spool/wt400/gateways/blah/traplist.out and you would do your work in /var/spool/wt400/gateways/blah/out
Ok, so without further ado - here is my problem:
I need to have about 20 copies of the exact same script running where the only difference is the two arguements past to it because there is a race condition beyond my control and now I am using way more memory than the shell scripts ever were. I compared:
ps -el | grep <shell> - sz = 50
ps -el | grep <perl> - sz > 300
I know where the gap is coming from and I could handle the difference for everything else I gained if it were only one copy, but that difference gets multiplied by every copy running (about 20).
The only thing that comes to mind is Threads, but I have heard such conflicting information I didn't even consider it when I started the port.
Do I have to abandon my code or is there a way to take advantage of my multi-proc high end server to have one or maybe two or three handle all the directories???
Thanks in advance - L~R
Re: 3 weeks wasted? - will threads help?
by perrin (Chancellor) on Jan 27, 2003 at 22:11 UTC
|
I really don't understand your statement about needing to run 20 copies of the script because of a race condition, but if you want to save memory you should try starting up one script and forking rather than starting 20 different copies of Perl. The copy-on-write nature of memory handling in modern OSes will save you quite a bit of RAM. | [reply] [Watch: Dir/Any] |
|
Fair enough....
In my readmore tags - I explained that each copy works on a different directory - each directory is very transient and that is a race condition beyond my control
The extra memory overhead is coming from the Perl intrepreter - not the code itself (or at least that is my belief) - see below:
#!/usr/bin/perl -w
use strict;
while (1) {
print "I am only printing and sleeping\n";
sleep 1;
}
The above code shows up in ps -el with almost the same sz as the code in my readmore tags
Forking will not buy me anything as I understand it since I will be making an exact duplicate (memory and all). I was thinking threads may help, but as I understand them - each thread gets its own copy of the intrpreter - no memory savings either
So my question stated more clearly is:
Given a piece of code to parse a single directory, how can I parse multiple directories concurrently (or very nearly) without the memory overhead of each piece requiring its own intrepreter?
Concatenating the files in each directory into one long list isn't feasible either.
I freely admit that I may be asking to get something for nothing, but it seems like an awful waste not to be able to use the Perl code and continue using the shell script :-(
Cheers - L~R | [reply] [Watch: Dir/Any] [d/l] |
|
Actually, forking will buy you a lot. You are not understanding what I said about copy-on-write. On a modern OS, when you fork the memory is not actually copied. Only pages that get changed are copied, and the rest is shared between processes. It makes a huge difference and is something that is widely used with mod_perl to reduce the memory overhead of multiple Perl interpreters.
| [reply] [Watch: Dir/Any] |
|
|
|
|
| [reply] [Watch: Dir/Any] |
Re: 3 weeks wasted? - will threads help?
by BrowserUk (Patriarch) on Jan 27, 2003 at 23:18 UTC
|
Looking at your code, you appear to be processing one the other or both of the directories 'in/*' & 'out/*' relative to the path "/var/spool/wt400/gateways/" . $ARGV[0].
Presumably each of your 20 copies of the script is processing a different subtree of /var/spool/wt400/gateways/?
In which case, you could do your initial chdir to and to your globing as <*/in/*> etc. and process all the files from the 20 subdirs in one loop. I notice that you have a sleep 3 in your main loop, which probably means that your not utiliting much of the cpu as it stands, so you should have easily enough processor to cope with the 20 dirs in the main loop. You might need to change that sleep to
sleep 3-$time_spent_last_pass.
I realise that the traplist file is different for each subtree, but the <*/in*> form of the glob return the filenames in the form subdir/in/file so you can then split on the /'s and extract the subdir and use this as the first key in your
%traps hash to select the appropriate set of traps information for the file.
It means re-working your code a somewhat, but probably less work than moving to either use Threads; or fork.
Examine what is said, not who speaks.
The 7th Rule of perl club is -- pearl clubs are easily damaged. Use a diamond club instead.
| [reply] [Watch: Dir/Any] [d/l] [select] |
|
It's just not possible BrowserUk
The transient files are creating with a naming syntax so that they are automatically listed in oldest to newest from the glob. It can take up to two seconds to process a single directory, but once I am done - I have a fair amount of assurance (I will change the sleep statement if it is too much) that I can wait 3 seconds before I parse the directory again. I can't wait over half a minute, which is the possibility of happening if I parse all directories at one time.
I really need to parse each directory as if it were the only one.
Cheers - L~R
| [reply] [Watch: Dir/Any] |
|
Why not glob, then fork and proccess the globbed data in the child while sleeping 2 seconds in the parrent and looping all over again? also to answer you question about forking above check out your system's man page for fork(2) I am pretty sure HPUX using copy-on-write (only copies the page stack and changed mem locs) since 10.x
-Waswas
| [reply] [Watch: Dir/Any] |
|
|
(jptxs) Re: 3 weeks wasted? - will threads help?
by jptxs (Curate) on Jan 28, 2003 at 01:12 UTC
|
| [reply] [Watch: Dir/Any] |
Re: 3 weeks wasted? - will threads help?
by busunsl (Vicar) on Jan 28, 2003 at 13:45 UTC
|
| [reply] [Watch: Dir/Any] |
Re: 3 weeks wasted? - will threads help?
by castaway (Parson) on Jan 28, 2003 at 13:00 UTC
|
| [reply] [Watch: Dir/Any] |
|
|