Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

Re^3: threads::shared seems to kill performance

by BrowserUk (Pope)
on Jul 18, 2013 at 06:04 UTC ( #1044974=note: print w/ replies, xml ) Need Help??


in reply to Re^2: threads::shared seems to kill performance
in thread threads::shared seems to kill performance

Originally the data comes from a SQLite database....

Then I very strongly advise against taking the data out of the db and putting it into a hash.

Not only will doing so take considerable time and substantial space, although for read-only use you won't need locking, there is no way to turn off the locking Perl uses to protect its internals, and that will bring your application to a crawl.

Instead, share the db handle and create statement handles for your queries. Whilst I haven't done this personally (yet), according to this, the default 'serialized' mode of operation means that you don't even need to do user locking as the DB will take care of that for you.

If you create/clone your DB as an in-memory DB, after you've spawned your threads; then you will avoid the duplication of that DB and the performance should be on a par with, and potentially faster than a shared hash.

When I get time, which may not be soon, I intend to test this scenario for myself as I think it might be a good solution to sharing large amounts of data between threads. Something Perl definitely needs.

It may even be possible to wrap it over in a tied hash to simplify the programmers view of the DB without incurring the high overheads of threads::shared (That's very speculative!).

In any case, as your data is already in a DB; don't take it out and put it in shared hashes. That just doesn't make sense. Just load it into memory after you threads are spawned; and then set the dbh into a shared variable where the threads can get access to it.

At least, that is what I would (and will) try.


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.


Comment on Re^3: threads::shared seems to kill performance
Re^4: threads::shared seems to kill performance
by roboticus (Canon) on Jul 18, 2013 at 11:09 UTC

    BrowserUk:

    Since I can only ++ a post once, you'll have to settle for a few more virtuals: ++ ++ ++

    ...roboticus

    When your only tool is a hammer, all problems look like your thumb.

Re^4: threads::shared seems to kill performance
by Jacobs (Novice) on Jul 18, 2013 at 20:50 UTC

    I actually originally started by only loading data from the DB in smaller chunks - the way I'm using that DB allows me to split the whole table into about 5000 smaller ones. So instead of doing one SELECT and creating one huge hash (which always felt a bit wrong to me), I did 5000 smaller SELECTs and worked through those in sequence.

    Performance of this however was terrible - roughly speaking about 20 seconds for loading that 1 big SELECT from DB vs 220 seconds for 5000 smaller SELECTs - all via one DB handle. And this was on SSD.

    As for the in-memory DB, this is the first time I've heard about it and admittedly it looks very promising. Thanks for the hint - let me give that a try...

Re^4: threads::shared seems to kill performance
by Jacobs (Novice) on Jul 19, 2013 at 02:05 UTC

    @BrowserUK, what exactly do you mean by this?

    <quote>Instead, share the db handle and create statement handles for your queries. </quote>

    I'm using DBI and that doesn't seem to be very thread friendly. Also SQLite DBD doesn't mention threading anywhere in the documentation.

    When I try to pass the db handle to my thread as a parameter, I get this:

    $g_dbh = DBI->connect("dbi:SQLite:dbname=:memory:"); threads->create(\&my_thread, $g_dbh); Thread 1 terminated abnormally: DBD::SQLite::db prepare failed: handle + 2 is owned by thread 7f7f64003200 not current thread 7f7f6455fc00 (h +andles can't be shared between threads and your driver may need a CLO +NE method added)

    If I try to share the db handle, I get:

    our $g_dbh :shared; $g_dbh = DBI->connect("dbi:SQLite:dbname=:memory:"); Invalid value for shared scalar

      Sorry. It looks like we'll have to try and locate some alternative to (DBI) using sqlite from Perl.

      DBI is broken beyond understanding and the documentation is so crap that working out how to fix it is beyond my abilities.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

      Okay. I found an obscure detail(Section entitled "In-memory Databases And Shared Cache" ) in the sqlite documentation that allows me to work around DBI's brokeness.

      This demonstrates it:

      #! perl -slw use strict; use threads; use threads::shared; use threads::Q; use Time::HiRes qw[ time ]; use DBI; use constant { CONNECT=> 'dbi:SQLite:dbname=file:memdb?mode=memory&cache=shared', CREATE => 'create table if not exists DB ( ID integer(8),' . join(',', map "F$_ text(15)", 1..9) . ')', INSERT => 'insert into DB ( ID , ' . join( ',', map "F$_", 1..9 ) . ') values (' . '?,' x 9 . '?) +', INDEX => 'create index if not exists I1 on DB ( ID )', QUERY => 'select * from DB where ID = ?', }; sub thread { my $tid = threads->tid; my( $Q ) = @_; $Q->dq; ## Wait for DB my $dbh = DBI->connect( CONNECT, '', '' ) or die DBI::errstr; my $sth = $dbh->prepare( QUERY ) or die DBI->errstr; while( my $id = $Q->dq ) { $sth->execute( $id ) or die DBI::errstr; my $r = $sth->fetch or warn( "No data for $id" ) and next; ## do something with record. printf "[$tid] %5u %s %s %s %s %s %s %s %s %s\n", @{ $r }; } $sth->finish; $dbh->disconnect; } my @chars = ( 'a'..'z' ); sub dummy { my $n = shift; join '', @chars[ map int( rand @chars ), 1 .. $n ]; } our $T //= 4; our $N //= 100; my $dbh = DBI->connect( CONNECT, '', '', { AutoCommit =>0 } ) or die D +BI::errstr; $dbh->do( 'PRAGMA synchronous = off' ); $dbh->do( 'PRAGMA cache_size = 800000' ); $dbh->do( CREATE ) or die DBI::errstr; my $ins = $dbh->prepare( INSERT ) or die DBI->errstr; for my $n ( 1 .. $N ) { my @fields = ( $n, map dummy( 15 ), 1 .. 9 ); $ins->execute( @fields )or die $ins->errstr; $n %100 or $dbh->commit } $ins->finish; $dbh->commit; #$dbh->do( INDEX ) or die DBI::errstr; my $sth = $dbh->prepare( QUERY ) or die DBI->errstr; for my $id ( 1 .. 5 ) { $sth->execute( $id ) or die DBI::errstr; my $r = $sth->fetch() or warn( "No data for $id" ) and next; ## do something with record. printf "[main] %5u %s %s %s %s %s %s %s %s %s\n", @{ $r }; } $sth->finish; print $dbh->selectrow_array( 'SELECT count(*) from DB' ); my $Q = threads::Q->new( 10 ); my @threads = map threads->create( \&thread, $Q, shared_clone( $dbh ) +), 1 .. $T; $Q->nq( ('Wakey wakey') x $T ); $Q->nq( $_ ) for 1 .. 20; ## $N; $Q->nq( (undef) x $T ); $_->join for @threads; $dbh->disconnect; unlink 'file'; ## without this, the memory db will persist in a appare +ntly empty file???? ## That might be turned into an advantage????

      There is however some weirdness associated with this. (See the comments on the last two lines above).

      Basically, the trick to making multiple DBI handles (in different threads) refer to the same in memory database is using a dbname of the form:

      file:memdb?mode=memory&cache=shared

      The weirdness is that although this is a "memory db", a file (in this example called 'file' in the local directory) is created. It will always be 0 bytes and running wc on it confirms it has no content.

      But, run the above program a second time and it will add to the number of records in the table. IT WILL DOUBLE IN SIZE!

      But if you delete that empty file, the records from the first run disappear!

      Not sure what to do about that other than just delete the file.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

        Wow, this is quite something. I'm trying to absorb all that and I have probably a very stupid question at the beginning - what is threads::Q? I can't find that on CPAN or anywhere else.

        So I finally got some time to play with this and I still can't wrap my head around it.

        If I understand it correctly, the trick is in the fact that you:

        1. load the whole DB as a memdb with shared cache
        2. then use the
          shared_clone( $dbh )
          to get this memory where the DB is loaded shared with all the threads
        3. and then inside the thread create a new DB handler, but thanks to the fact you're using the same memdb with shared cache, it reuses the same memory as the DB handler in the main thread

        Is that even remotely correct? :-) And if so:

        1. Why do the shared_clone( $dbh )? Isn't the shared memdb enough for the threads to have access to the same memory? I've been trying to somehow see the difference between the two scenarios (with and without passing the shared_clone( $dbh )) and I'm probably not measuring this right, but the memory usage (vm_stat) seems to be similar in both cases.
        2. What's the purpose of the
          $Q->nq( ('Wakey wakey') x $T ); ... $Q->dq; ## Wait for DB
          ? Why pass the string to the thread and then just dequeue it? And what's the relation to waiting for DB?

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1044974]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others scrutinizing the Monastery: (7)
As of 2014-08-21 07:13 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The best computer themed movie is:











    Results (128 votes), past polls