Beefy Boxes and Bandwidth Generously Provided by pair Networks vroom
Think about Loose Coupling
 
PerlMonks  

Re: Multithreading leading to Out of Memory error

by BrowserUk (Pope)
on Jun 07, 2013 at 16:35 UTC ( #1037704=note: print w/ replies, xml ) Need Help??


in reply to Multithreading leading to Out of Memory error

There is simply not enough information here to judge whether you've encountered one of the (increasingly rare) thread memory leaks, or if your ParseXXX() routines are just badly coded. Any conclusions or recommendations based on the scant information you've posted will be premature; and likely wrong.

If you want real answers and solutions; post the real code.

How many files does that 231GB dataset contain?


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.


Comment on Re: Multithreading leading to Out of Memory error
Re^2: Multithreading leading to Out of Memory error
by joemaniaci (Sexton) on Jun 07, 2013 at 17:18 UTC

    It's about 3000 files. I can't post all of it because it's on a classified network. All I can say is that it is mostly ascii file types, one .xls file, as well as one hex formatted file that I read through and do checks on. For each file I create a running log file with any errors of the file and at the end I insert a record into a database basically saying whether or not the file was bad. I guess if anything, I'll have to look to see if it's any of the modules I am using that could be a problem as well. Which are...

    use threads(...); use Thread:: Queue; use File::Find; use File::Basename; use DBI; use DBD::ODBC; use Spreadsheet::ParseExcel; use Switch; use File::Find;

    There is definitely a gradual increase in memory usage over time.

    I don't have any references, which could lead to circular references that the perl garbage collector doesn't pick up.

    Are there any tools I can use? Such as checking memory usage before and after a method call to ensure that the amount of used memory before and after the dealloc for that particular method is the same?

    What happens in perl when two threads call on the same method? I assume each will just get their own copy, which will be fine since again, no shared resources.

      I can't post all of it because it's on a classified network.

      It should be perfectly possible (and legitimate) to produce a cut-down, but runnable version of your program that shows the generation of the file list, queue handling, and thread procedure(s) etc. without any of the proprietary logic being evident or discoverable. Ie. Discard the logging; change search constants to innocuous values; rename variables if they hint at the purpose or method of the code. etc.

      Ask your mechanic help you diagnose the problems with your car; whilst you've left it at home in the garage and see what reaction you get.

      I'll have to look to see if it's any of the modules I am using that could be a problem

      Switch is problematic and deprecated. (Nothing to do with threading.)

      Spreadsheet::ParseExcel is known to leak badly even in single-threaded code.

      DBI is (I believe) fine for multi-threaded use these days; but historically, many of the DBD::* modules (or their underlying C libraries) were not thread-safe.

      Personally, I still advocate only using DBI from a single thread within multi-threaded apps. Setup a second Q and have your processing threads queue their SQL to a single threaded dedicated to deal with the DB.

      My reaction to your further description is that I would be splitting up your all-purposes thread procedure into several independent, specialist thread procedures each fed by a different queue and I would be performing the filetype selection process before queuing the names.

      That would allow (for example) the .xls procedure to be the only one that requires Spreadsheet::ParseExcel, rather than loading that into every thread.

      Ditto, by segregating out the DBI, DBI::ODBC and associated DBD::* modules and requireing them into a separate, standalone thread fed by a queue, you reduce the size of all the other threads and ensure that you only need a single, persistent connection to the DB; and so remove another raft of possible conflicts in the process.

      By making each thread dedicated to a particular type of file processing -- and only loading that stuff required for that particular processing into that thread -- you avoid duplicating everything into every thread -- thus saving some memory. You can also then easily track down which type of thread is leaking and, if necessary, arrange for that (type of) thread to be re-started periodically.

      I'd also avoid the practice of first generating a big array of files; and then dumping that array into a queue en-masse. At the very least, arrange for the main thread to feed the queue from the array at a rate that prevents the queue growing to more than is necessary to keep the threads fed with work; 2 or 3 times the number of threads is usually a good starting point.

      But why not start the threads immediately and then feed the queue as the files are discovered, thus overlapping their discovery with the processing.

      And if you go for my dedicated filetype threads with their own queues, then you can also do the selection and allocation at the same time.

      But without seeing the code, this is all little more than educated guessing.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

        Well it's nice to see I wasn't going completely down the wrong path. I already removed Switch, and ParseExcel is actually the last thing done when all the threads have already been destroyed. I usually only come across 2-3 .xls files so that part is single threaded.

        When it comes to DBI and DBI::ODBC I tried...

        require DBI; require DBI::ODBC;

        inside the methods that need it, instead of...

        use DBI; use DBI::ODBC;

        ...at the very top but the behavior was very erratic.

        Ironically enough I discovered this bug while getting ready to work on feeding the queue as soon as I found files with the correct extensions instead of, but I wanted to resolve this before attempting that.

        I also thought about creating a single subroutine to handle each file type, but the issue is that one file type is always 1kb and so it's thread would be done in seconds and then not doing anything afterwards. While other files are gargantuan. The goal was that small/medium(which are the majority) files could be handled while the bigger files were being processed over time

        This is how I grab all the files I need and it is only done once at the very beginning, before threads are created

        find sub { $File = $File::Find::name, -d && '/'; $if($File =~ /\.extA$/ || $File =~ /\.extB$/ .....) { $File =~ s/some formatting stuff/; push(@FoundFiles, $File); } },$Directory;

        Outside of that is pretty much all the file checks, making sure this line has the right number of items, bounds checks, so on and so forth. Nothing complicated, mostly simple regex stuff.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1037704]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others scrutinizing the Monastery: (8)
As of 2014-04-18 23:34 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    April first is:







    Results (473 votes), past polls