ph0enix has asked for the wisdom of the Perl Monks concerning the following question:

Hi monks,
In one project of mine I need to share huge data structure (hash of arrays of objects). I use client / server model in my application. I expected to use threads for serving clinets and data on the server will be shared via threads::shared. My problem is that threads::shared can't share complex data structures and objects. How can this be solved? All threads need to be able to modify this shared data...

I see one solution, but this will rapidly decrease performance - save shared data to the some kind of database - BerkeleyDB, PostgreSQL, etc. BerkeleyDB can be good solution, but are the existing packages designed for using with tie interface prepared for concurrent acces? Is there sample perl code?

Comments, suggestions?

  • Comment on How to share huge data structure between threads?

Replies are listed 'Best First'.
Re: How to share huge data structure between threads?
by diotalevi (Canon) on Jan 10, 2003 at 14:43 UTC

    BerkeleyDB (http://www.sleepycat.com) is well suited to such an application. Unfortunately the perl module is really light on documentation. I'll provide a very quick example here but you'll really want to read the documentation on the database web site. Since the perl module is based on the C API you'll want to read the C API documentation and then just where it uses some C code, pretend it's perl code.

    The one caveat is that I never use the tie interface. All it does is call the object oriented interface anyway so I save a method call and just use the database as it's designed to be used. I have an example of object oriented (though not using the CDB features) BerkeleyDB up at http://www.greentechnologist.org/tiger/unpack.pl and http://www.greentechnologist.org/tiger/graph.pl. The CDB features just "happen" if you enable them.

    use strict; use warnings; use BerkeleyDB; my $env = get_environment(); my $db = BerkeleyDB::Btree->new ( -Filename => 'my_file.db', -Flags => DB_CREATE, -Env => $env ) or die "Couldn't open database at my_file.db: $BerkeleyDB::Error"; # the database now supports concurrant access. You'd # just open it in each thread and use it. See # http://www.sleepycat.com/docs/ref/cam/intro.html # for info on the concurrant system. # You can also do nested transactions and logging. See # http://www.sleepycat.com/docs/ref/transapp/intro.html # and continue next otherwise just read the docs from the # table of contents. sub get_environment { BerkeleyDB::Env->new ( -Flags => DB_CREATE | DB_INIT_MPOOL | DB_INIT_CDB ) or die "Couldn't initialize BerkeleyDB environment: $BerkeleyDB::Error"; }

    Update I should add that the SleepyCat documentation explicitly notes that BerkeleyDB's concurrant access modes work correctly across threads. I posted a code example for multi process access - your multi-threaded example should read similarly though there's no real reason you should need threading given your specified requirements.

    Update I didn't know the perl module BerkeleyDB wasn't thread safe. The underlying library is. So if you're to follow my suggestion then probably you want multiple processes.


    Fun Fun Fun in the Fluffy Chair

Re: How to share huge data structure between threads?
by djantzen (Priest) on Jan 10, 2003 at 15:04 UTC

    Implicit sharing of nested structures is prohibited because it creates the potential for accidential sharing of private data. Since the ithreads model is predicated upon complete separation of all data by default, to allow the capacity to implicitly share references within shared parent structures is to open the door to accidental corruption of data. From perlthrtut

    use threads; use threads::shared; my $var = 1; my $svar : shared = 2; my %hash : shared; ... create some threads ... $hash{a} = 1; # all threads see exists($hash{a}) and $hash{a} == + 1 $hash{a} = $var # okay - copy-by-value: same effect as previous $hash{a} = $svar # okay - copy-by-value: same effect as previous $hash{a} = \$svar # okay - a reference to a shared variable $hash{a} = \$var # This will die delete $hash{a} # okay - all threads will see !exists($hash{a})

    So the solution using threads is to take references to the things you wish to share at each level of a parent structure and to share them on a case by case basis. In other words, you must explicitly share not only the parent reference, but every reference contained therein.

    Here's some example code of a basic object with shared members:

    use strict; use warnings; package Foo; sub new { my ($class, $arg) = @_; my $this = bless {}, $class; $this->{args} = undef; return $this; } sub set { my ($this, $arg) = @_; $this->{args}[0] = $arg; # setting an entry in a shared array refe +rence } 1; # End of the module, and now a test script use strict; use warnings; use Foo; use threads; use threads::shared; my $foo = new Foo(); my $nested_array = []; my $nested_string = 'bar'; share($foo); share($nested_array); share($nested_string); $foo->{args} = $nested_array; # set the shared array reference # pass in a reference to the shared scalar my $thr1 = threads->create(sub { $foo->set(\$nested_string) }); <Update> # If in Foo::set we manually set the argument passed, say, to 'quux', # the object will contain that string rather than 'bar', # proof that we do indeed have a shared nested reference. </Update> $thr1->join(); print $foo->{args}[0];

    It's a bother to do this, but it's better than accidental trampling of data. Hope this helps.

Re: How to share huge data structure between threads?
by PodMaster (Abbot) on Jan 10, 2003 at 14:53 UTC
    Yes and no.
    DB_File.pm is not thread safe.
    Neither is BerkleyDB.pm

    The strategy with DB_File is to blessed(%hash)->flush after writing, and to retie before reading to ensure you got the latest data.

    This will work fine but only if you use a newer version of BerkleyDB (anything about 2.5 will work fine with this technique).

    If you want better transaction control, use BerkeleyDB.pm, and you got access to the full api (just go buck wild).

    You other choice to consider is DBD::SQLite.

    If any of this is too slow for you, you can always use Cache::Cache


    MJD says you can't just make shit up and expect the computer to know what you mean, retardo!
    ** The Third rule of perl club is a statement of fact: pod is sexy.

Re: How to share huge data structure between threads?
by dragonchild (Archbishop) on Jan 10, 2003 at 15:12 UTC
    Here's a few stupid questions:
    1. Why are you using threads instead of processes? Apache's children are processes and it's extremely robust. Apache doesn't necessarily have to serve HTML, either. It's a CGI server which can serve anything you want. And, Perl can be tightly intergrated into it.
    2. Why not set up the shared datastructure as a SOAP process and have your children communicate with it? That way, you can even have your objects on another server and still be ok.

    ------
    We are the carpenters and bricklayers of the Information Age.

    Don't go borrowing trouble. For programmers, this means Worry only about what you need to implement.

Re: How to share huge data structure between threads?
by broquaint (Abbot) on Jan 10, 2003 at 15:30 UTC
    My problem is that threads::shared can't share complex data structures and objects. How can this be solved?
    I don't advocate this solution nor am I proud of it, but ...
    use threads; use threads::shared; use Devel::Pointer; { package foo; sub new { bless [rand 100] } sub blah { print "in blah()\n" } } my $obj = foo->new(); $obj->blah(); print "$obj: @$obj\n"; my $o : shared = address_of($obj); $t = threads->new(sub { print "\tin thread\n\t"; my $obj2 = deref($o); $obj2->blah(); print "\t$obj2: @$obj2\n"; }); $t->join; __output__ in blah() foo=ARRAY(0x804beec): 43.5769482822256 in thread in blah() foo=ARRAY(0x804beec): 43.5769482822256
    Now just look into the little memory-wiping stick ... *flash*.
    HTH

    _________
    broquaint

      Looks interesting, but if $obj goes out of scope created object will be destroyed (reference counter is zero and so variable can be garbage collected) and can't be restored from stored address. This occurs eg. if variable is created in one thread and should be used in other one.

Re: How to share huge data structure between threads?
by LogicalChaos (Beadle) on Jan 10, 2003 at 18:09 UTC
    Well, you don't say how big huge is, so...

    Have you looked into IPC::Shareable? You can tie your hash to the shared memory region, and then just make sure you lock it appropriatly before reading/writing. I use this in multi process (not threads) programs and it works quite well for me.

    If the standard shared memory segment size is too small (32Mb?), you can increase it runtime by adjusting /proc/sys/kernel/shmmax or re-building the kernel

    Cheers,
    Rob

       Well, you don't say how big huge is, so...

      Server I'm testing on have 3GB RAM which will be probably insufficient for final application. Current size of the test data I want to share is about 600MB (after loading to perl). I don't think that IPC::Shareable can fit this requirements.

        One thought. If your reluctant to move to using a file-based data sharing option (like Tie::DBI or other) because of speed, you might consider creating a Ramdisk and placing the tied file on there.


        Examine what is said, not who speaks.

        The 7th Rule of perl club is -- pearl clubs are easily damaged. Use a diamond club instead.

        OUCH... Seems like a real dB is the only way to go? What sort of performance are you needing? Do you have large quantities of keys or data associated with the keys?

        Have you seen Tie::DBI? I've not used it, but it appears interesting, and might be a quick fit to your application.

        Please post your eventual solution back to this thread, as I'm curious what you come up with.

        Luck,
        Chaos

      As a sidenote, I just got burned at IPC::Shareable. Not badly. I just failed to RTFM, and discovered the rather hard way the 64K default size of shared memory "segments" or "partitions" or whatever. Read the manual, and look at the size option when tie-ing.
        Opps. Sorry, I forgot about mentioning that one. I remember hit the same problem when I first used IPC::Shareable... And I think it took me a day or two to finally read the manual.

        But now you've got lots of memory to use, up to the kernel set limit ;-)

        Cheers,
        Logical