Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses

3 weeks wasted? - will threads help?

by Limbic~Region (Chancellor)
on Jan 27, 2003 at 21:00 UTC ( #230355=perlquestion: print w/replies, xml ) Need Help??

Limbic~Region has asked for the wisdom of the Perl Monks concerning the following question:

I have spent the last 3 weeks converting a suite of shells scripts to Perl. The purpose of this can be found here, although two of my initial requirements changed after a very long hard look at the transient files I was checking against.

  • Only the first 64K of the file needs to be read - if the string I am looking for is not there, it doesn't matter if it is elswhere in the file.
  • Removing imbedded newlines wasn't a real requirement since a match in the first 64K will never have the newline problem.

    The following is the final code

    #!/usr/bin/perl -w use strict; use Time::Local; chdir "/var/spool/wt400/gateways/" . $ARGV[0]; mkdir "capture", 0755 unless (-d "capture"); my $ListTime = 0; my %Traps; my @Files; my $Counter = 1; my $Size; my $Now; my $NF; my $Matcher; my $Match_code; open (LOG, ">>/disk4/Logs/traps/" . $ARGV[0] . "_" . $ARGV[1] . ".log" +); flock(LOG,(2|4)) or exit; select LOG; while (1) { if ($Counter > 20 || ! %Traps) { if ( (stat("traplist." . $ARGV[1]))[9] gt $ListTime ) { $ListTime = (stat(_))[9]; %Traps = (); open (LIST,"traplist." . $ARGV[1]); while (<LIST>) { next if (/^#/ || /^Created\t\tExpires/ || /^\s*$/); my @Fields = split "\t" , $_; next unless (@Fields == 8); chomp $Fields[7]; my($mon, $day, $year, $hour, $min) = split ?[-/:]? , $Fields[1 +]; my $Expiration = timelocal(0, $min, $hour, $day, $mon - 1, $y +ear + 100); $Traps{$Fields[7]} = [ $Expiration,@Fields[2,5,6] ]; } close (LIST); } $Counter = 1; } $Now = time; $Match_code = ""; $Size = 0; foreach my $Trap (keys %Traps) { unless ($Traps{$Trap}[0] < $Now && $Traps{$Trap}[1]) { if ($Traps{$Trap}[3] eq "SIZE") { $Size = $Traps{$Trap}[2] if ($Traps{$Trap}[2] > 0); } else { $Trap =~ s/(\W)/\\$1/g; $Trap = "(?i-xsm)" . $Trap; $Match_code .= "return \"$Trap\" if \$_[0] =~ /$Trap/;"; } } } exit unless ($Match_code || $Size); $Matcher = eval "sub {" . $Match_code . "}"; if ($ARGV[1] eq "out") { @Files = <out/do*>; } elsif ($ARGV[1] eq "in") { @Files = <in/di*>; } else { @Files = <out/do* in/di*> } matchfile(\@Files); $Counter++; sleep 3; } sub matchfile { local($/) = \65536; FILE: while (my $File = shift @{$_[0]}) { if ($Size && -s $File >= $Size) { ($NF = $File) =~ s/^.*\///; rename $File , "capture/" . $NF . "-SIZE"; print time . " " . $NF . " " . (stat(_))[7] . " SIZE\n"; next FILE; } unless (open(FILE, $File)) { next FILE; } while (<FILE>) { my $Match = $Matcher->($_); if ($Match) { $Match =~ s/\(\?i-xsm\)//; ($NF = $File) =~ s/^.*\///; rename $File , "capture/" . $NF . "-" . $Traps{$Match}[3]; print time . " " . $NF . " " . (stat(_))[7] . " " . $Traps{$Ma +tch}[3] . "\n"; } next FILE; } } }

    The traplist file that the data is read from looks like:

    Created         Expires         Use     Type    Author  Size    Name    Trap
    07:36:56-07:36  07:36:56-07:36  1       0       XYZ     98765   SIZE    N/A
    07:36:56-07:36  07:36:56-07:36  1       0       XYZ     N/A     TRAP1   cool things to look for

  • The first arg is the name of the directory to look for the traplist file in as well as the base directory to work from based on arg 2.
  • The second arg gives the second piece of information to find the traplist file as well as the directory to work in

    If arg1 = blah, you would look for the traplist file in /var/spool/wt400/gateways/blah
    If arg2 = out, you would open /var/spool/wt400/gateways/blah/traplist.out and you would do your work in /var/spool/wt400/gateways/blah/out

  • Ok, so without further ado - here is my problem:

    I need to have about 20 copies of the exact same script running where the only difference is the two arguements past to it because there is a race condition beyond my control and now I am using way more memory than the shell scripts ever were. I compared:

  • ps -el | grep <shell> - sz = 50
  • ps -el | grep <perl> - sz > 300

    I know where the gap is coming from and I could handle the difference for everything else I gained if it were only one copy, but that difference gets multiplied by every copy running (about 20).

    The only thing that comes to mind is Threads, but I have heard such conflicting information I didn't even consider it when I started the port.

    Do I have to abandon my code or is there a way to take advantage of my multi-proc high end server to have one or maybe two or three handle all the directories???

    Thanks in advance - L~R

    Replies are listed 'Best First'.
    Re: 3 weeks wasted? - will threads help?
    by perrin (Chancellor) on Jan 27, 2003 at 22:11 UTC
      I really don't understand your statement about needing to run 20 copies of the script because of a race condition, but if you want to save memory you should try starting up one script and forking rather than starting 20 different copies of Perl. The copy-on-write nature of memory handling in modern OSes will save you quite a bit of RAM.
        Fair enough....

        In my readmore tags - I explained that each copy works on a different directory - each directory is very transient and that is a race condition beyond my control

        The extra memory overhead is coming from the Perl intrepreter - not the code itself (or at least that is my belief) - see below:

        #!/usr/bin/perl -w use strict; while (1) { print "I am only printing and sleeping\n"; sleep 1; }

        The above code shows up in ps -el with almost the same sz as the code in my readmore tags

        Forking will not buy me anything as I understand it since I will be making an exact duplicate (memory and all). I was thinking threads may help, but as I understand them - each thread gets its own copy of the intrpreter - no memory savings either

        So my question stated more clearly is:

        Given a piece of code to parse a single directory, how can I parse multiple directories concurrently (or very nearly) without the memory overhead of each piece requiring its own intrepreter?
        Concatenating the files in each directory into one long list isn't feasible either.

        I freely admit that I may be asking to get something for nothing, but it seems like an awful waste not to be able to use the Perl code and continue using the shell script :-(

        Cheers - L~R

          Actually, forking will buy you a lot. You are not understanding what I said about copy-on-write. On a modern OS, when you fork the memory is not actually copied. Only pages that get changed are copied, and the rest is shared between processes. It makes a huge difference and is something that is widely used with mod_perl to reduce the memory overhead of multiple Perl interpreters.
          Given a piece of code to parse a single directory, how can I parse multiple directories concurrently (or very nearly) without the memory overhead of each piece requiring its own intrepreter? Concatenating the files in each directory into one long list isn't feasible either.

          I'd probably switch to an event-based model - maybe use POE. Threads are probably not what you want in this instance. At the moment Perl's threads are fairly heavy and slow - more like wet string :-)

    Re: 3 weeks wasted? - will threads help?
    by BrowserUk (Patriarch) on Jan 27, 2003 at 23:18 UTC

      Looking at your code, you appear to be processing one the other or both of the directories 'in/*' & 'out/*' relative to the path "/var/spool/wt400/gateways/" . $ARGV[0].

      Presumably each of your 20 copies of the script is processing a different subtree of /var/spool/wt400/gateways/?

      In which case, you could do your initial chdir to and to your globing as <*/in/*> etc. and process all the files from the 20 subdirs in one loop. I notice that you have a sleep 3 in your main loop, which probably means that your not utiliting much of the cpu as it stands, so you should have easily enough processor to cope with the 20 dirs in the main loop. You might need to change that sleep to

      sleep 3-$time_spent_last_pass.

      I realise that the traplist file is different for each subtree, but the <*/in*> form of the glob return the filenames in the form subdir/in/file so you can then split on the /'s and extract the subdir and use this as the first key in your %traps hash to select the appropriate set of traps information for the file.

      It means re-working your code a somewhat, but probably less work than moving to either use Threads; or fork.

      Examine what is said, not who speaks.

      The 7th Rule of perl club is -- pearl clubs are easily damaged. Use a diamond club instead.

        It's just not possible BrowserUk

        The transient files are creating with a naming syntax so that they are automatically listed in oldest to newest from the glob. It can take up to two seconds to process a single directory, but once I am done - I have a fair amount of assurance (I will change the sleep statement if it is too much) that I can wait 3 seconds before I parse the directory again. I can't wait over half a minute, which is the possibility of happening if I parse all directories at one time.

        I really need to parse each directory as if it were the only one.

        Cheers - L~R

          Why not glob, then fork and proccess the globbed data in the child while sleeping 2 seconds in the parrent and looping all over again? also to answer you question about forking above check out your system's man page for fork(2) I am pretty sure HPUX using copy-on-write (only copies the page stack and changed mem locs) since 10.x

    (jptxs) Re: 3 weeks wasted? - will threads help?
    by jptxs (Curate) on Jan 28, 2003 at 01:12 UTC
      Without knowing the whole situation it's a bit hard to follow all of this, but here's my take. After reading your previous post, it strikes me that maybe you can make a controlling program in perl which gathers and stores the data that needs storing and then the shell scripts based on their output. Again, not knowing the whole scope of the issues it's hard to say if that would work, but in case you hadn't explored that possability it may help. I figure if it worked as a shell script but needed more, maybe perl can just provide the more and leave what worked within the performance parameters you need already alone.

      We speak the way we breathe. --Fugazi

    Re: 3 weeks wasted? - will threads help?
    by busunsl (Vicar) on Jan 28, 2003 at 13:45 UTC
      If you don't want to try threads, continuations or coroutines may be of help. They make multitasking without forking or threading possible.

      Have a look at Marc Lehmanns Coro module.

    Re: 3 weeks wasted? - will threads help?
    by castaway (Parson) on Jan 28, 2003 at 13:00 UTC
      Well, I like 5.005 threads :)
      Just to compare: My script that uses threads shows 3 versions of itsself in 'ps -el' all with a size of about 2400. According to 'top' however, each thread is using about twice that amount of memory (4600), and 1600 shared. So one of them must be wrong, but which?

      C. *waves the threading flag* ;)

  • Log In?

    What's my password?
    Create A New User
    Domain Nodelet?
    Node Status?
    node history
    Node Type: perlquestion [id://230355]
    Approved by Ovid
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others about the Monastery: (7)
    As of 2022-05-26 14:34 GMT
    Find Nodes?
      Voting Booth?
      Do you prefer to work remotely?

      Results (93 votes). Check out past polls.