Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

Multithreading leading to Out of Memory error

by joemaniaci (Sexton)
on Jun 07, 2013 at 15:58 UTC ( #1037694=perlquestion: print w/ replies, xml ) Need Help??
joemaniaci has asked for the wisdom of the Perl Monks concerning the following question:

So I have a new multithreading implementation that should be smooth running and for the most part it is. It wasn't until I tried evaluating 231 Gb of files that I started getting "Out of Memory!" errors and having the program die. I have been over everything three times now and still cannot figure it out. So here is what I have...

use threads('yield','stack_size' => 64*4096, 'exit' => 'threads_only', + 'stringify'); use Thread::Queue; use various others(File::..., DBI.... my @FoundFiles = subroutine to get all of the applicable files(and the +ir directories). my $Threads = 8; my $workq = Thread::Queue->new(); $workq->enqueue(@FoundFiles); $workq->enqueue(undef) for (1..$Threads); threads->create('executeall') for (1..$Threads); sub executeall { while(my $i = $workq->dequeue()) { last if $i eq undef; if($i =~ /filetypea/) { parseitthisway($i); } if($i =~ /filetypeb/) { parseitanotherway($i); } .... .... } threads->detach(); }

Now in my googling I have come across several references talking about perl threads maybe holding on to excess data over time. As you can tell, the only real global piece of data I am using is the queue. Therefore, all the real data is being built up in the individual parse subroutines, meaning the perl garbage cleanup should be taking care of that once the subroutine returns. Unless it's bugged or something. The only thing I can think of is destroying and recreating my threads every 200-300 file iterations.

FINAL SOLUTION

EDIT: I don't know if it is a bug in Perl or what, but essentially the purpose of the array was to store certain data, and then get quantities of repeated elements. I changed the array to a hash and changed...

push(@array, $data);

into

++{%newHotness{$data};

Then I got the quanties I wanted later on down the road...

while(my ($k, $v) = each(%newHotness)) { ...check that I have the expected $v per $k }

Unless perl has some maximum array limit, and unless there was some sort of overflow issue, I have no idea why the original implementation had such a bad memory leak. Either way the memory leak is gone.

So I went back out of curiosity and looked at that array, essentially commented out the new code and uncommented the old implementation and looked at the sizes of the arrays. At one point the array held over 500,000 items, so I don't know if that is a lot or a little. Either way, some googling has led me to the fact that perl doesn't care so long as the system has the memory to store it. Now my machine has 16 gigs and never even came close to being fully utilized. I am assuming the operating system placed some limit on the perl.exe itself. Either way it doesn't matter because it wouldn't be an issue if the array was properly reclaimed. I did it by the book...

@array = (); undef @array;

...which had zero effect. So my theory stands that there may be an overflow somehow and so when the functions return and garbage collection is performed, it is missing the memory that was overflowed. FYI, I have perl v5.12.3 compiled for multi-thread. I am going to try to test my theory and I guess depending on the results, it may get a new question.

Final edit

So after doing some research into perl memory management I see what's going on. So...

@array = (); undef @array;

...doesn't do what I thought it does. It simple clears out all the C pointers and structure, but not the memory. Perl by default holds onto that memory, the idea being that the developers of perl had speed as a greater priority than memory utilization. So it holds onto that memory for future use, in hopes that it'll save time to reuse it, instead of having to request memory from the system. You can compile perl to use your operating systems malloc() implementation, but then you lose your ability to move your application across systems.

I think in single threads, I was fine because each consecutive file reused the memory allocated the first time around. Once I multithreaded it, issues arose because I might have had multiple threads parsing the file in question, at which point I needed memory allocated for 2,3,4 or more files, or more precisely, the array in question.

Comment on Multithreading leading to Out of Memory error
Select or Download Code
Re: Multithreading leading to Out of Memory error
by BrowserUk (Pope) on Jun 07, 2013 at 16:35 UTC

    There is simply not enough information here to judge whether you've encountered one of the (increasingly rare) thread memory leaks, or if your ParseXXX() routines are just badly coded. Any conclusions or recommendations based on the scant information you've posted will be premature; and likely wrong.

    If you want real answers and solutions; post the real code.

    How many files does that 231GB dataset contain?


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      It's about 3000 files. I can't post all of it because it's on a classified network. All I can say is that it is mostly ascii file types, one .xls file, as well as one hex formatted file that I read through and do checks on. For each file I create a running log file with any errors of the file and at the end I insert a record into a database basically saying whether or not the file was bad. I guess if anything, I'll have to look to see if it's any of the modules I am using that could be a problem as well. Which are...

      use threads(...); use Thread:: Queue; use File::Find; use File::Basename; use DBI; use DBD::ODBC; use Spreadsheet::ParseExcel; use Switch; use File::Find;

      There is definitely a gradual increase in memory usage over time.

      I don't have any references, which could lead to circular references that the perl garbage collector doesn't pick up.

      Are there any tools I can use? Such as checking memory usage before and after a method call to ensure that the amount of used memory before and after the dealloc for that particular method is the same?

      What happens in perl when two threads call on the same method? I assume each will just get their own copy, which will be fine since again, no shared resources.

        I can't post all of it because it's on a classified network.

        It should be perfectly possible (and legitimate) to produce a cut-down, but runnable version of your program that shows the generation of the file list, queue handling, and thread procedure(s) etc. without any of the proprietary logic being evident or discoverable. Ie. Discard the logging; change search constants to innocuous values; rename variables if they hint at the purpose or method of the code. etc.

        Ask your mechanic help you diagnose the problems with your car; whilst you've left it at home in the garage and see what reaction you get.

        I'll have to look to see if it's any of the modules I am using that could be a problem

        Switch is problematic and deprecated. (Nothing to do with threading.)

        Spreadsheet::ParseExcel is known to leak badly even in single-threaded code.

        DBI is (I believe) fine for multi-threaded use these days; but historically, many of the DBD::* modules (or their underlying C libraries) were not thread-safe.

        Personally, I still advocate only using DBI from a single thread within multi-threaded apps. Setup a second Q and have your processing threads queue their SQL to a single threaded dedicated to deal with the DB.

        My reaction to your further description is that I would be splitting up your all-purposes thread procedure into several independent, specialist thread procedures each fed by a different queue and I would be performing the filetype selection process before queuing the names.

        That would allow (for example) the .xls procedure to be the only one that requires Spreadsheet::ParseExcel, rather than loading that into every thread.

        Ditto, by segregating out the DBI, DBI::ODBC and associated DBD::* modules and requireing them into a separate, standalone thread fed by a queue, you reduce the size of all the other threads and ensure that you only need a single, persistent connection to the DB; and so remove another raft of possible conflicts in the process.

        By making each thread dedicated to a particular type of file processing -- and only loading that stuff required for that particular processing into that thread -- you avoid duplicating everything into every thread -- thus saving some memory. You can also then easily track down which type of thread is leaking and, if necessary, arrange for that (type of) thread to be re-started periodically.

        I'd also avoid the practice of first generating a big array of files; and then dumping that array into a queue en-masse. At the very least, arrange for the main thread to feed the queue from the array at a rate that prevents the queue growing to more than is necessary to keep the threads fed with work; 2 or 3 times the number of threads is usually a good starting point.

        But why not start the threads immediately and then feed the queue as the files are discovered, thus overlapping their discovery with the processing.

        And if you go for my dedicated filetype threads with their own queues, then you can also do the selection and allocation at the same time.

        But without seeing the code, this is all little more than educated guessing.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Multithreading leading to Out of Memory error
by sundialsvc4 (Abbot) on Jun 08, 2013 at 14:31 UTC

    It is also relevant to consider whether these could be processes instead of threads.   In the former case, an entire separate memory-management context is created; in the latter, both Perl’s memory manager and its own particular flavor of quasi-threads implementation is in effect throughout.   Threads all run in the same memory-management context (with suitable complications), which is not torn-down.   (My knowledge of the perlguts of the thread implementation is minimal; others here are experts and gurus.)

    It would be useful to know if the same behavior occurs when there is only one thread, and/or when the processing is done sequentially in the main thread.   Does it, or does it not, foul-up after processing a certain number of files?   Does alteration of the number of threads, alter the point at which it hoses-up?   You should also note exactly which Perl version you are using.

    Yes, “committing hari-kiri” is a legitimate way to forestall memory-leak problems especially in unknown processes.   (The technique is useless for threads, as described.)   FastCGI and mod_perl programs are sometimes deliberately arranged to process some n number of requests before they voluntarily terminate, upon which case the parent-process wakes up, reaps the child, then launches another copy until the pool of worker-threads is restored.   (Some separate provision would need to be made for the parent to be aware of end-of-job.)

      (The technique is useless for threads, as described.)

      Can you demonstrate that?

        BrowserUK, didn’t you “forget” to log in?

        There is nothing “personal,&rduo; nor technically un-informed, about my specific comments here, most specifically including the comment about processes vs. threads.   Threads in every programming system share a single process-level context; hence, the same memory-management system.   During the course of execution, a “leaky” procedure can, in time, accumulate an excess amount of unrecoverable memory.   In a context of threads, that memory is never cleaned-up, whereas by definition the entire context of a process, is.

        If the “hari kiri” approach didn’t work, as a way of dealing in a black-box fashion with leaky faucets, then it would not be the case that Apache, nginix, and PSGI (Plack) all have specific means by which to do just that.   My comments are technically valid at face value, as they were intended to be.   If you have disagreement, then (a) show yourself, and (b) comment about the technical statements, not the Monk making them.

      So after a few days of intermittent network connectivity(data is on a networked drive) and testing I figured it out. I think.

      push(@array, $data);

      I went down the path of using only 1 thread for processing. I went down the path of parsing only a single file type. Once I started the behavior went away, and of course it wasn't until the final file type that the behavior came back. I looked through my code to see what it had that no other file type had and it was...

       @array = sort { $a <=> $b } @array;

      So I took out that code and tested again, the problem was still present. So I kind of commented things out piece-meal until I narrowed it down to the single line of code above. With that single line of code 3 additional MB of memory is used up as the thread leaves the parsing method for that particular file type.

      So here is the basic rundown of this file.

      sub parsefiletypeX { my $filename = shift; #get the directory from the filename(w/ directory) open(IN, $filename) or die... open(OUT, $outfile) or die... my $lineCount = 1; my $nextline = <IN>; my $headerlines; my $samplesize = 1; my @array; $nextline = trim_whitespace($nextline);#my subroutine ++$LineCount; #Did this before I learned about $. #read the header for the file(five lines) #Do a bunch of regex checks on the header lines #Read in the first record, which contains its own line of sub he +ader data as well as the two lines of actual data in pascal float for +mat I believe. Or fortran actually. #Regex on the first line and push a certain piece of data. This +line is NOT the bad line push( @array, @fi[4]);#1st line has 5 values #Regex checks on the next two lines #Then read in the rest of the 3-lined records until(eof(IN)) { #get the same three lines and do the regex checks #now the faulty call is made push( @array, @fi[4]); } //Do stuff to the @array, like sorting and determining certain c +hecks. close IN; close OUT; #call function that uploads a record to the database.

      ,

      So after a few days of intermittent network connectivity(data is on a networked drive) and testing I figured it out. I think.

      push(@array, $data);

      I went down the path of using only 1 thread for processing. I went down the path of parsing only a single file type. Once I started the behavior went away, and of course it wasn't until the final file type that the behavior came back. I looked through my code to see what it had that no other file type had and it was...

       @array = sort { $a <=> $b } @array;

      So I took out that code and tested again, the problem was still present. So I kind of commented things out piece-meal until I narrowed it down to the single line of code above. With that single line of code 3 additional MB of memory is used up as the thread leaves the parsing method for that particular file type.

      So here is the basic rundown of this file.

      sub parsefiletypeX { my $filename = shift; #get the directory from the filename(w/ directory) open(IN, $filename) or die... open(OUT, $outfile) or die... my $lineCount = 1; my $nextline = <IN>; my $headerlines; my $samplesize = 1; my @array; $nextline = trim_whitespace($nextline);#my subroutine ++$LineCount; #Did this before I learned about $. #read the header for the file(five lines) #Do a bunch of regex checks on the header lines #Read in the first record, which contains its own line of sub he +ader data as well as the two lines of actual data in pascal float for +mat I believe. Or fortran actually. #Regex on the first line and push a certain piece of data. This +line is NOT the bad line push( @array, @fi[4]);#1st line has 5 values #Regex checks on the next two lines #Then read in the rest of the 3-lined records until(eof(IN)) { #get the same three lines and do the regex checks #now the faulty call is made push( @array, @fi[4]); } //Do stuff to the @array, like sorting and determining certain c +hecks. close IN; close OUT; #call function that uploads a record to the database.

      I even tried clearing out that array after I was done using it, such as...

      @array = (); undef @array;

      But this has no effect!?!? So what is going on?

        So what is going on?

        There is simply not enough information here to even begin to guess.

        If you want this debugged, you are going to have to find some way around your 'can't post the real code' problem and supply -- publicly or privately -- real, runnable source code + sample data that demonstrates the problem. If not, you're on your own I'm afraid.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Multithreading leading to Out of Memory error
by coyocanid (Acolyte) on Jun 10, 2013 at 03:17 UTC

    Rather than use threads, I've been using a drop in replacement : forks. I am so far happy with it.

    http://search.cpan.org/~rybskej/forks-0.34/lib/forks.pm

    From the POD : The standard Perl 5.8.0 threads implementation is very memory consuming, which makes it basically impossible to use in a production environment, particularly with mod_perl and Apache. Because of the use of the standard Unix fork() capabilities, most operating systems will be able to use the Copy-On-Write (COW) memory sharing capabilities (whereas with the standard Perl 5.8.0 threads implementation, this is thwarted by the Perl interpreter cloning process that is used to create threads). The memory savings have been confirmed.

      Well considering it's the pushing of a float onto an array that's causing problems, I am not even sure if this is longer a thread issue. I am at a loss really. Going to see if it's possible to implement that functionality with a hash.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1037694]
Front-paged by Arunbear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others pondering the Monastery: (6)
As of 2014-11-23 21:14 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My preferred Perl binaries come from:














    Results (134 votes), past polls