Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Multithreaded Script CPU Usage

by Zenshai (Sexton)
on Aug 25, 2008 at 14:44 UTC ( #706697=perlquestion: print w/ replies, xml ) Need Help??
Zenshai has asked for the wisdom of the Perl Monks concerning the following question:

Venerable Monks,

I come to the monastery once again, this time seeking advice about the CPU Use of a multi-threaded script I recently wrote.

First let me explain the script. Its purpose is to walk filesystems, gather information about files it encounters, write it down into temporary files, and import those files into a mysql database table. Essentially creating a giant 'dir' listing. Not very elegant, I know - but it suits our purposes.

Anyway. The actual scanning is done by Worker threads using a very nice freeware utility from the makers of TreeSize, called FileList. The threads are pooled using the Thread::Pool::Simple module. I'll post the code at the bottom of this message.

The script works, but the problem is that it is that the actual Perl process is using way more resources than (I think) it should. Here's what the typical Task Manager pane ends up looking like:

Windows Task Manager
Image NameUser NameCPUMem Usage
Perl.exeZenshai75125,588K
FileList.exeZenshai06688K
FileList.exeZenshai043,952K
FileList.exeZenshai02488K
FileList.exeZenshai02688K
cmd.exeZenshai0072K
cmd.exeZenshai0072K
cmd.exeZenshai0072K
cmd.exeZenshai0072K
etc...Zenshai00<2000K


So as you can see, it seems that Perl is hogging most of the resources when all it should really be doing is handing out jobs from a queue. I think this is having an impact on the performance of the FileList.exe processes, normally they use about 08-12 CPU units each.

I'm still very new to Perl, so I don't even know where to begin debugging a problem like this. Looking for any help you may be able to provide.

Here is the code creating the pool and adding the jobs:
my $pool = Thread::Pool::Simple->new( min => 2, # at least 2 workers max => $total_threads, # at most x workers load => 10,# add worker if each worker has 10 jobs queued do => [ \&do_handler ], # job handler for each worker passid => 1, # pass the job id->1st arg to &do_handler lifespan => 10000, # total jobs handled by each worker ); open( FLP, $deeper_file_path ); foreach my $file_path (<FLP>) { $pool->add( $file_path, $filelistEXE_path, $scan_table_name); } $pool->join();


Update: I am running: Perl v5.8.8 built for MSWin32-x86-multi-thread
threads version 1.63

Let me know if there are any other snippets I can post that would help figure this out.

Thanks for reading, and sorry for the long post.

Comment on Multithreaded Script CPU Usage
Download Code
Re: Multithreaded Script CPU Usage
by renodino (Curate) on Aug 25, 2008 at 15:01 UTC
    1. What version of Perl are you using ?
    2. What version of threads are you using ?
    3. If you're using a fairly recent version of threads, have you tuned the per-thread stacksize to something small/reasonable ?

    Perl Contrarian & SQL fanboy
      Updated OP with the version info.

      As for your #3 I think the answer is no, but I cant seem find where to set that in Thread::Pool::Simple.pm

      Would changing 'use threads;' to
      use threads ( 'stack_size' => 64*4096, );
      in Thread::Pool::Simple.pm be the way to set the stack size? if so, whats a 'small' value?
        Your "use threads" should do the trick, or any of the alternatives described in the "THREAD STACK SIZE" section of the threads POD...just do it as early as possible. I'd suggest that 64 * 4096 is probably a lot more than you need, as Perl has its own notion of a runtime stack, and in reality doesn't use that much process/thread stack (unless you're using very large/complex regexes, and even that issue goes away in 5.10). Note that once stack size is set, its set for all the threads/modules, so you don't need to do anything in Thread::Pool.

        Also keep in mind that each thread has its own Perl interpretter, so memory size may be an unavoidable issue, esp if the threads are spawned from another thread that already has a lot of context (due to the interpretter cloning).

        Update:
        I just realized your issue has 2 parts: memory and CPU; note that my comments thus far have been entirely about the memory issue. However, given the size of the Perl process, might the CPU issue be related to paging/swapping ?


        Perl Contrarian & SQL fanboy
Re: Multithreaded Script CPU Usage
by Illuminatus (Curate) on Aug 25, 2008 at 15:36 UTC
    I think I need a little more info about your program structure. It looks like new threads invoke 'do_handler', but it is not clear how this, in turn, invokes and manages FileList. I would not so much worry about perl's memory usage -- 125M is not all that much. Much more troubling is the 75% CPU. If FileList is supposed to be doing most of the work, and the perl part is simply 'glue' taking output from FileList and inserting it into MySQL, then it should not be doing much. If it is not too large, could you post the do_handler, and what you are doing to put stuff into MySQL?
      Illuminatus, yes that is in fact what I am mostly worried about. In fact I think the Mem Usage here is due to one of the threads doing a regex replace, adding backslashes to prepare a file to be imported into mysql.

      Edit: Just to clarify, I know that the CPU use should also be high when doing a regex on a large file with many matches, however, the CPU usage of the perl.exe process is always high, even when no regex operations are being performed.

      My do_handler function just routes the work to a function in the separate Worker module (I couldn't figure out the syntax to call it directly)

      Here's the relevant code from the Worker module:
      package File::FileListerWorker; sub main_run { # arguments: 1) job id from Pooler 2) path to scan 3) Path to FileLi +st.exe 4)table name # returns: 1) return 1; #### - START: MAIN EXECUTABLE CODEBLOCK - #### my $jobid = shift; # read scanpaths from the ARGs my $scanpaths_line = shift; # get path to scan my $fl_path = shift; # define path to FileList.exe my $arg_tbl_name = shift; # table to load data into my $dbh = &mysql_conn(); # connect to mysql my $mysql_tablename = &mysql_create_table( $arg_tbl_name, $dbh ); +# create table (IF NOT EXISTS) my @ta = localtime(time); # get time my $time_string = $ta[4] . $ta[3] . ( $ta[5] + 1900 ) . "_" . $ta[2] . $ta[1] . $t +a[0]; # time string my $temp_filename = # make unique filename "\\\\directoryToPlaceTempFileIn\\" . $jobid . "_" . $time_string . ".csv"; &do_scan( $fl_path, $scanpaths_line, $temp_filename ); # call to s +ub that does the scan my $mysql_temp_filename = &fix_file_4mysql($temp_filename); # add +backslashes &mysql_load_data( $mysql_temp_filename, $mysql_tablename, $dbh ); +# mysql import file &cleanup($temp_filename); #delete temp file to preserve space &mysql_disc($dbh); # disconnect mysql #### - END: MAIN EXECUTABLE CODEBLOCK - #### return 1; } sub do_scan { #arguments: 1)location of FileList.exe 2)path to scan 3)output fi +le path #returns: none my $fl_path = shift; my $pth = shift; chomp($pth); my $temp_filename = shift; system(qq{$fl_path $pth >$temp_filename}); }

        Just an idea... what if everything is perfectly right that way? I am not sure how costly fix_file_4mysql() and mysql_load_data() are, but if the do_scan() execution is much faster than loading the DB, perl.exe is still digesting the output of previous runs of FileList.exe. So, you'll might observe only four FileList.exe jobs that are currently running while the output of several dozen previous runs is still processed.

        How big is max => $total_threads? The number of $temp_filename's currently existing might provide a rough estimation of how many workers are currently busy.

Re: Multithreaded Script CPU Usage
by BrowserUk (Pope) on Aug 26, 2008 at 00:44 UTC

    I suspect that your cpu usage problem is related to this failure when running the Thread::Pool::Simple test suite:

    c:\Perl\packages\Thread-Pool-Simple-0.23>nmake test Microsoft (R) Program Maintenance Utility Version 7.00.9466 Copyright (C) Microsoft Corporation. All rights reserved. C:\Perl\bin\perl.exe "-MExtUtils::Command::MM" "-e" "test_harn +ess(0, 'blib\lib', 'blib\arch')" t\1.t t\2.t t\3.t t\4.t t\1....ok 1/0Can't call method "down" on unblessed reference at C:\Per +l\packages\Thread-Pool-Simple-0.23\blib\lib/Thread/Pool/Simple.pm lin +e 39. # Looks like your test died just after 1. t\1....dubious Test returned status 255 (wstat 65280, 0xff00) after all the subtests completed successfully t\2....ok 1/0Can't call method "down" on unblessed reference at C:\Per +l\packages\Thread-Pool-Simple-0.23\blib\lib/Thread/Pool/Simple.pm lin +e 39. # Looks like your test died just after 1. t\2....dubious Test returned status 255 (wstat 65280, 0xff00) after all the subtests completed successfully t\3....ok 1/0Can't call method "down" on unblessed reference at C:\Per +l\packages\Thread-Pool-Simple-0.23\blib\lib/Thread/Pool/Simple.pm lin +e 39. # Looks like your test died just after 1. t\3....dubious Test returned status 255 (wstat 65280, 0xff00) after all the subtests completed successfully t\4....ok 1/0Can't call method "down" on unblessed reference at C:\Per +l\packages\Thread-Pool-Simple-0.23\blib\lib/Thread/Pool/Simple.pm lin +e 39. # Looks like your test died just after 1. t\4....dubious Test returned status 255 (wstat 65280, 0xff00) after all the subtests completed successfully Failed Test Stat Wstat Total Fail Failed List of Failed ---------------------------------------------------------------------- +--------- t\1.t 255 65280 1 0 0.00% ?? t\2.t 255 65280 1 0 0.00% ?? t\3.t 255 65280 1 0 0.00% ?? t\4.t 255 65280 1 0 0.00% ?? Failed 4/4 test scripts, 0.00% okay. 0/4 subtests failed, 100.00% okay +. NMAKE : fatal error U1077: 'C:\Perl\bin\perl.exe' : return code '0x2' Stop.

    This failure can be traced to this line from the T::P::S code:

    my $self = &share({});

    Which make the T:::P::S object a shared hash. The problem is that when a few lines later, it creates an instance of Thread::Semaphore and stores the returned handle into that hash:

    $self->{shutdown_lock} = Thread::Semaphore->new();

    Because the hash is shared, the blessedness of the Thread::Semaphore object is lost. Ie. It stops being a blessed scaler and becomes a plain ordinary scalar. And when it attempts to call a methd on that scalar:

    $self->{shutdown_lock}->down(); ## line 39

    it fails (Note:The line number is different because I added some debug:

    ???SCALAR(0x19a690c) Can't call method "down" on unblessed reference at C:\Perl\packages\Thread-Pool-Simple-0.23\blib\lib/Thread/Pool/Simple.p +m line 43.

    This is a newly introduced bug that didn't exist when I tried T::P::Simple a few few weeks ago. Now, you can "fix" this test failure by upgrading to the latest CPAN versions of threads and threads::shared.

    Unfortunately, with that upgrade, as with so many of the releases ever since they were dual lived, comes a different set of problems. The T::P::Simple now completes all its tests, but leaks memory like a sieve:

    c:\Perl\packages\Thread-Pool-Simple-0.23>nmake test Microsoft (R) Program Maintenance Utility Version 7.00.9466 Copyright (C) Microsoft Corporation. All rights reserved. C:\Perl\bin\perl.exe "-MExtUtils::Command::MM" "-e" "test_harn +ess(0, 'blib\lib', 'blib\arch')" t\1.t t\2.t t\3.t t\4. t\1....ok 1/0Scalars leaked: 1 Scalars leaked: 1 Scalars leaked: 1 Scalars leaked: 1 Scalars leaked: 1 Scalars leaked: 1 Scalars leaked: 1 Scalars leaked: 1 Scalars leaked: 1 Scalars leaked: 1 t\1....ok t\2....ok 901/0Scalars leaked: 1 Scalars leaked: 1 Scalars leaked: 1 Scalars leaked: 1 Scalars leaked: 1 t\2....ok t\3....ok 901/0Scalars leaked: 1 Scalars leaked: 1 Scalars leaked: 1 Scalars leaked: 1 Scalars leaked: 1 Scalars leaked: 1 Scalars leaked: 1 Scalars leaked: 1 Scalars leaked: 1 Scalars leaked: 1 Scalars leaked: 1 Scalars leaked: 1 Scalars leaked: 1 Scalars leaked: 1 Scalars leaked: 1 Scalars leaked: 1 Scalars leaked: 1 Scalars leaked: 1 Scalars leaked: 1 Scalars leaked: 1 t\3....ok t\4....ok 2/0Scalars leaked: 1 t\4....ok All tests successful. Files=4, Tests=1805, 20 wallclock secs ( 0.00 cusr + 0.00 csys = 0.0 +0 CPU)

    Which wouldn't be so bad if T::P::Simple was the only module affected, but with the latest versions of threads & threads::shared, programs that ran completely clean with 5.8.6 and its distributed version of threads/threads::shared, now leak like a sieve also..


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Im not sure what's different in our test environments, but I seem to be having no problems in any of the tests for T::P::S

      This is what I get:
      D:\Perl\MSWin32-x86-multi-thread-5.8\Thread-Pool-Simple-0.23>nmake tes +t Microsoft (R) Program Maintenance Utility Version 1.50 Copyright (c) Microsoft Corp 1988-94. All rights reserved. D:\Perl\bin\perl.exe "-MExtUtils::Command::MM" "-e" "test_harn +ess(0, 'blib\lib', 'blib\arch')" t/*.t t/1....ok t/2....ok t/3....ok t/4....ok All tests successful. Files=4, Tests=1805, 16 wallclock secs ( 0.00 cusr + 0.00 csys = 0.0 +0 CPU)
Re: Multithreaded Script CPU Usage
by Zenshai (Sexton) on Aug 26, 2008 at 13:52 UTC
    Thanks, everyone.

    I think what I am going to do is split this script in 2, the threaded script will only launch the scans, and I'll have another script, on another box monitor the directory where the temp files are placed, process them, import them, and delete them.

    Hopefully the memory leak BrowserUk describe won't get bad enough to kill my Perl process. Otherwise, expect another long winded post from me about forking or somesuch. :)
      Just a few comments:
      1) You should not need 2 scripts. If your initial script did not create threads, but instead began scans using 'system' and putting the process in background, then the same script can monitor the output files and start new FileList executions as previous ones end.
      2) There is still the issue of what the 2 mysql functions really do. You might post the 'fix-the-file' function and the 'load data' function code if they are not too large. Either of these could also be taking lots of CPU. Does your mysql_load_data create bulk inserts or individual inserts? bulk inserts are much faster, both for the client and the server
        The reason I am going to try spliting the script across 2 machines is because I was thinking the same thing as Perlbotics above. There is just too much load on the resources if I try to do everything at the same time. By splitting the global task in 2 parts I can either run both concurrently on 2 machines or in sequence on 1, depending on my time constraints.

        But just in case, here is the rest of the code you were asking for. Am I doing something horribly wrong there that is causing needless load on the CPU?

        The Load data function just executes this sql:
        my $sql = qq{ LOAD DATA LOCAL INFILE \"$mysql_tmp_filename\" INTO TABLE `$tblname` FIELDS TERMINATED BY ',' OPTIONALLY enclosed by '\"' ESCAPED BY '\\\\' LINES TERMINATED BY '\\r\\n' IGNORE 3 LINES (file_name,\@file_p,file_size,\@file_la,\@file_lc,\@file_c, file_exten +sion) set file_last_access = STR_TO_DATE(\@file_la, '%c/%e/%Y %l:%i %p'), file_path = \@file_p, file_share_name = substring_index(substring_index(\@file_p,'\\\\',3),' +\\\\',-1), file_last_change = STR_TO_DATE(\@file_lc, '%c/%e/%Y %l:%i %p'), file_creation = STR_TO_DATE(\@file_c, '%c/%e/%Y %l:%i %p') }; my $sth = $dbh->prepare($sql); $sth->execute();
        and here is fix_file_4mysql:
        sub fix_file_4mysql { #arguments: 1) path to file #returns: 1) the filename fixed for use in a mysql query #fix the contents of the temp file for use in mysql my $tmp_filename = shift; open( IN, "+<$tmp_filename" ); my @file = <IN>; seek IN, 0, 0; foreach my $file (@file) { $file =~ s/\\/\\\\/g; print IN $file; } close IN; #fix temp_filename for use in mysql my $mysql_tmp_filename = $tmp_filename; $mysql_tmp_filename =~ s/\\/\\\\/g; return $mysql_tmp_filename; }
        From what I can see, the functions you listed should be fine. My point about the single script went back to the initial problem. Your initial problem was that the main perl script was taking 75% CPU. As was pointed out, this was likely caused by the thread module itself. If you modified your script as I suggested, the perl script should end up using very little CPU, and could therefore run fine together on a single system. Of course, your mileage may vary...
      Well, there goes that theory.

      I remade the script as I specified above, and while its working exactly as intended, the original problem still remains. Perl CPU usage is still too high when it should be idle.

      I've found that I can give the FileList processes some more CPU room by setting the Perl process's priority to Below Normal or Low, but that's not really solving the problem. Its still needlessly using resources that can be better spent doing other things, or even being idle.

      I'm going to think about some other way to do this, without threads.
Re: Multithreaded Script CPU Usage
by NiJo (Friar) on Aug 26, 2008 at 19:53 UTC
    I see that your approach comes from automating interactive tools. I'm sure that perl can do a lot simpler, better and faster.

    A disk (with one disk head) is basically a single threaded application. Trying to keep 10 threads happy results in a lot of seeks and very slow performance. Sometimes it is OK to have 2 threads: one waiting for a disk seek and the other parsing a different file.

    I'd redesign the application to be pure perl and single threaded. If you need to scan several disks, I'd start one scanner per disk and directly feed into the database using DBI.

    The following "code" is just for showing the perl way to do it with prebuild modules. It is mostly copied from perldoc, won't run out of the box and has no error handling. I could not find a good way to get author information from MS office files. Win32::OLE and some digging into MS Explorer OLE should help.

    use strict; use warnings; use File::Find; use DBI; $dsn = "DBI:mysql:database=$database;host=$hostname;port=$port"; $dbh = DBI->connect($dsn, $user, $password); $sth = $dbh->prepare("INSERT INTO table(foo,bar,baz) VALUES (?,?,?)"); sub wanted { my @stats = stat $_; $sth->execute( $foo, $bar, $baz ); } find(\&wanted, @directories_to_search);
    P.S: Basically you have reinvented the unix 'locate' tool and ignored Microsofts indexing files for faster search.
      I initially started off exactly the way you're saying, but file::find and stat() was just too slow. FileList is a lot faster (no clue why). I need the entire scan to finish in one weekend, and there can be up to 80 million files to scan. That's also why I went for multithreading and etc. Im trying to squeeze any juice I can out of the resources I have to get this script to run in its allotted window.

      As for your one scanner per disk suggestion, I don't think starting 300 scanners is the answer here. :)

      I know it's ugly, and part of me weeps somewhere deep inside while I write this stuff, but requirements are making me do it. :(
        It took me some days off and rethinking to guess at your real bottleneck. Did you forget to tell us about the low bandwitdth remote windows network shares you are scanning? 300 local disks is too much to believe. Then collect the data locally into files and transfer them to the server. The parallel scanning should take about a minute for these file systems just above desktop size.

        But after solving all the scanning performance issues you are finally bound by the insert limit of your database. 80 million records are very huge. 1000 inserts per second would be my estimate for mysql on a desktop. That is 22 hours in your case. But do your own benchmark for multiple and faster CPUs and disks.

        Back to the drawing board you need to eliminate the overhead of the database. We don't know anything about use cases of your database output. When it is just finding a file once per week, I'd cat the local output files together and grep the 800 MB text file at disk speed in less than 10 seconds. In more advanced cases Perl regexps should have lower overhead than SQL on queries and no insert limit bottleneck.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://706697]
Approved by kyle
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others lurking in the Monastery: (13)
As of 2014-09-02 19:30 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite cookbook is:










    Results (29 votes), past polls