Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Re^3: Multithreading leading to Out of Memory error

by BrowserUk (Pope)
on Jun 07, 2013 at 19:58 UTC ( #1037745=note: print w/ replies, xml ) Need Help??


in reply to Re^2: Multithreading leading to Out of Memory error
in thread Multithreading leading to Out of Memory error

I can't post all of it because it's on a classified network.

It should be perfectly possible (and legitimate) to produce a cut-down, but runnable version of your program that shows the generation of the file list, queue handling, and thread procedure(s) etc. without any of the proprietary logic being evident or discoverable. Ie. Discard the logging; change search constants to innocuous values; rename variables if they hint at the purpose or method of the code. etc.

Ask your mechanic help you diagnose the problems with your car; whilst you've left it at home in the garage and see what reaction you get.

I'll have to look to see if it's any of the modules I am using that could be a problem

Switch is problematic and deprecated. (Nothing to do with threading.)

Spreadsheet::ParseExcel is known to leak badly even in single-threaded code.

DBI is (I believe) fine for multi-threaded use these days; but historically, many of the DBD::* modules (or their underlying C libraries) were not thread-safe.

Personally, I still advocate only using DBI from a single thread within multi-threaded apps. Setup a second Q and have your processing threads queue their SQL to a single threaded dedicated to deal with the DB.

My reaction to your further description is that I would be splitting up your all-purposes thread procedure into several independent, specialist thread procedures each fed by a different queue and I would be performing the filetype selection process before queuing the names.

That would allow (for example) the .xls procedure to be the only one that requires Spreadsheet::ParseExcel, rather than loading that into every thread.

Ditto, by segregating out the DBI, DBI::ODBC and associated DBD::* modules and requireing them into a separate, standalone thread fed by a queue, you reduce the size of all the other threads and ensure that you only need a single, persistent connection to the DB; and so remove another raft of possible conflicts in the process.

By making each thread dedicated to a particular type of file processing -- and only loading that stuff required for that particular processing into that thread -- you avoid duplicating everything into every thread -- thus saving some memory. You can also then easily track down which type of thread is leaking and, if necessary, arrange for that (type of) thread to be re-started periodically.

I'd also avoid the practice of first generating a big array of files; and then dumping that array into a queue en-masse. At the very least, arrange for the main thread to feed the queue from the array at a rate that prevents the queue growing to more than is necessary to keep the threads fed with work; 2 or 3 times the number of threads is usually a good starting point.

But why not start the threads immediately and then feed the queue as the files are discovered, thus overlapping their discovery with the processing.

And if you go for my dedicated filetype threads with their own queues, then you can also do the selection and allocation at the same time.

But without seeing the code, this is all little more than educated guessing.


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.


Comment on Re^3: Multithreading leading to Out of Memory error
Re^4: Multithreading leading to Out of Memory error
by joemaniaci (Sexton) on Jun 07, 2013 at 21:30 UTC

    Well it's nice to see I wasn't going completely down the wrong path. I already removed Switch, and ParseExcel is actually the last thing done when all the threads have already been destroyed. I usually only come across 2-3 .xls files so that part is single threaded.

    When it comes to DBI and DBI::ODBC I tried...

    require DBI; require DBI::ODBC;

    inside the methods that need it, instead of...

    use DBI; use DBI::ODBC;

    ...at the very top but the behavior was very erratic.

    Ironically enough I discovered this bug while getting ready to work on feeding the queue as soon as I found files with the correct extensions instead of, but I wanted to resolve this before attempting that.

    I also thought about creating a single subroutine to handle each file type, but the issue is that one file type is always 1kb and so it's thread would be done in seconds and then not doing anything afterwards. While other files are gargantuan. The goal was that small/medium(which are the majority) files could be handled while the bigger files were being processed over time

    This is how I grab all the files I need and it is only done once at the very beginning, before threads are created

    find sub { $File = $File::Find::name, -d && '/'; $if($File =~ /\.extA$/ || $File =~ /\.extB$/ .....) { $File =~ s/some formatting stuff/; push(@FoundFiles, $File); } },$Directory;

    Outside of that is pretty much all the file checks, making sure this line has the right number of items, bounds checks, so on and so forth. Nothing complicated, mostly simple regex stuff.

      When it comes to DBI and DBI::ODBC I tried... require inside the methods that need it, instead of... use ...at the very top but the behavior was very erratic.

      That would probably only work if you moved all the DBI handling into a single thread.

      Try to isolate each potentially troubling module and re-run checking for memory growth. There's not really much more I can add.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1037745]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others browsing the Monastery: (3)
As of 2014-09-20 17:17 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (160 votes), past polls