http://www.perlmonks.org?node_id=1044936


in reply to threads::shared seems to kill performance

Yes, shared aggregates are considerably slower than non-shared.

But try it this way and it'll be about 2/3rds less slow:

use threads; use threads::shared; my %hashOf1000SharedHashes = map{ $_ => &share({}) } 1 .. 1000; my %data:shared; foreach my $x (1..5000) { $data{$x} = shared_clone( \%hashOf1000SharedHashes ); } undef %hashOf1000SharedHashes;

That said, building a 2D HoH of empty hashes (with consecutive numerical indices?) doesn't seem very useful.

Presumably that structure will need to be populated at some point -- and with that amount of data it must becoming in from outside the program -- and once you add the IO to fetch the data into the mix, the cost of making the data shared will pale into insignificance.

If instead of building a huge, empty shared data structure, and then populating it, (which will take considerable further time), you shared and populated it in one pass, you'd save considerable time and the sharing costs would almost disappear in amongst the IO costs.

Tell us more about what goes in this monster, where that comes from; and how it is used and we'll probably be able to help you save a lot of time.


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

Replies are listed 'Best First'.
Re^2: threads::shared seems to kill performance
by Jacobs (Novice) on Jul 18, 2013 at 05:07 UTC

    Hello BrowserUK, threading master of masters from what I hear! Thank you for your response.

    I'm aware I'm probably breaking several laws and killing small kittens in the process by allocating hash array this big.

    Originally the data comes from a SQLite database. There's on huge table that's tied via 2 levels of parameters - say: owner, date, some_data (with <owner,date being> unique and set of owners relatively small) - and with loading this into those hashes, I'm trying to introduce some structure to the data so that I can later access it from my program in a way I can easily understand and work with ($data{user}{date}[]).

    Strangely loading the data from the database doesn't have as big of an impact on the performance as the sharing does. In my real life tests - where I in fact do initialize the hash and populate it in one pass as you suggest - the loading from DB and populating the hash (with significantly reduced set of data) took about 2s. Once I added the sharing (in a way similar to my example above), it took about 26s.

      Originally the data comes from a SQLite database....

      Then I very strongly advise against taking the data out of the db and putting it into a hash.

      Not only will doing so take considerable time and substantial space, although for read-only use you won't need locking, there is no way to turn off the locking Perl uses to protect its internals, and that will bring your application to a crawl.

      Instead, share the db handle and create statement handles for your queries. Whilst I haven't done this personally (yet), according to this, the default 'serialized' mode of operation means that you don't even need to do user locking as the DB will take care of that for you.

      If you create/clone your DB as an in-memory DB, after you've spawned your threads; then you will avoid the duplication of that DB and the performance should be on a par with, and potentially faster than a shared hash.

      When I get time, which may not be soon, I intend to test this scenario for myself as I think it might be a good solution to sharing large amounts of data between threads. Something Perl definitely needs.

      It may even be possible to wrap it over in a tied hash to simplify the programmers view of the DB without incurring the high overheads of threads::shared (That's very speculative!).

      In any case, as your data is already in a DB; don't take it out and put it in shared hashes. That just doesn't make sense. Just load it into memory after you threads are spawned; and then set the dbh into a shared variable where the threads can get access to it.

      At least, that is what I would (and will) try.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

        BrowserUk:

        Since I can only ++ a post once, you'll have to settle for a few more virtuals: ++ ++ ++

        ...roboticus

        When your only tool is a hammer, all problems look like your thumb.

        I actually originally started by only loading data from the DB in smaller chunks - the way I'm using that DB allows me to split the whole table into about 5000 smaller ones. So instead of doing one SELECT and creating one huge hash (which always felt a bit wrong to me), I did 5000 smaller SELECTs and worked through those in sequence.

        Performance of this however was terrible - roughly speaking about 20 seconds for loading that 1 big SELECT from DB vs 220 seconds for 5000 smaller SELECTs - all via one DB handle. And this was on SSD.

        As for the in-memory DB, this is the first time I've heard about it and admittedly it looks very promising. Thanks for the hint - let me give that a try...

        @BrowserUK, what exactly do you mean by this?

        <quote>Instead, share the db handle and create statement handles for your queries. </quote>

        I'm using DBI and that doesn't seem to be very thread friendly. Also SQLite DBD doesn't mention threading anywhere in the documentation.

        When I try to pass the db handle to my thread as a parameter, I get this:

        $g_dbh = DBI->connect("dbi:SQLite:dbname=:memory:"); threads->create(\&my_thread, $g_dbh); Thread 1 terminated abnormally: DBD::SQLite::db prepare failed: handle + 2 is owned by thread 7f7f64003200 not current thread 7f7f6455fc00 (h +andles can't be shared between threads and your driver may need a CLO +NE method added)

        If I try to share the db handle, I get:

        our $g_dbh :shared; $g_dbh = DBI->connect("dbi:SQLite:dbname=:memory:"); Invalid value for shared scalar