Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation

Benchmark on deserializing data

by RL (Monk)
on Apr 26, 2007 at 20:33 UTC ( #612265=perlquestion: print w/replies, xml ) Need Help??
RL has asked for the wisdom of the Perl Monks concerning the following question:

Hi to all,

I've done some benchmark on deserializing data and don't actually understand why the deserialization of data serialized by Storable is slower than using plain pipe-seperated serialisation.

Maybe I'm missing the obvious, made a mistake or misinterpreted what I read about Storable.

I've got sets of data which I want to store in a database on disk using only core-modules coming with perl. That results in DBM files obviously. (some 10K of datasets to store).

I'm more concerned about retrieval (frequent reads on the DB) than on storing. The code below therefore only represents retrieval, not storage.

I'm trying to serialize a datastructure in order to store using $DB_HASH format. Benchmark is deserialisation of pipe-seperated values against Storable's freeze/thaw functions.

Here is the code:
use strict; use warnings; use Benchmark qw( :all ); use Storable qw(freeze thaw); my (%data, %hash_b, %hash_d); # data %data = ( 1 => ['123','456','678'], 2 => 'value_2', 3 => 'value_3', 4 => 'value_4', 5 => 'value_5', 6 => 'value_6', 7 => 'value_7', 8 => 'value_8' ); # prepare simulating retrieved data my $item_1 = join(' ',@{$data{'1'}}); my $pipe_serialized = $item_1.'|'.$data{'2'}.'|'.$data{'3'}.'|'.$data{ +'4'} .'|'.$data{'5'}.'|'.$data{'6'}.'|'.$data{'7'}.'|'. +$data{'8'}; my $storable_serialized = freeze(\%data); cmpthese( -1, { # serialized using pipes as a delimiter a => sub { my @ary = split(/\|/,$pipe_serialized); my %hash = (); @hash{'1','2','3','4','5','6','7','8'} = @ary; $hash{'1'} = [ split(/ /,$hash{'1'}) ]; }, b => sub { %hash_b = (); @hash_b{'1','2','3','4','5','6','7','8'} = split(/\|/,$pipe_serialized); $hash_b{'1'} = [ split(/ /,$hash_b{'1'}) ]; }, # serialized using storable c => sub { my $hash_ref = thaw($storable_serialized ); my %hash = %$hash_ref; }, d => sub { %hash_d = %{ thaw($storable_serialized ) }; }, } ); # check results use Data::Dumper; print "hash_b:\n",Dumper(\%hash_b),"\n\n\n"; print "hash_d:\n",Dumper(\%hash_d),"\n\n\n";

And here is the output on my box:
RESULT: Rate d c a b d 41155/s -- -25% -49% -59% c 55138/s 34% -- -31% -45% a 80388/s 95% 46% -- -20% b 100486/s 144% 82% 25% --

Ok then, now my questions:

(1) I do understand why _b_ is faster than _a_ - guess it's because of @ary acting as man-in-the-middle

(2) I do not understand why _c_ is faster than _d_. Why does $hash_ref as man-in-the-middle speed things up here?

(3) Most important I don't understand why both methods using Storable are slower than the other two using plain split for deserialization. From what I read I thought Storable was
(a) fast
(b) intended to be used for serializing Perl data structures.
Is there any known point (rule of thumb maybe) where Storable is faster than plain join/split stringification?

Any hint appreciated.
Thx for reading.


Replies are listed 'Best First'.
Re: Benchmark on deserializing data
by adrianh (Chancellor) on Apr 26, 2007 at 21:01 UTC
    Most important I don't understand why both methods using Storable are slower than the other two using plain split for deserialization.

    It's because Storable can cope with pretty much anything. Try doing something like this:

    use Storable qw( freeze thaw ); { no warnings 'once'; $Storable::Deparse=1; $Storable::Eval=1; } thaw( freeze( sub { print "hello world\n" } ) )->();

    with split - or persisting something with with a "|" character in a hash value :-)

    Storable is general - and you'll sometimes pay for that with speed.

      It's worth pointing out to anyone stumbling upon this thread that the above example will only restore a self contained anonymous sub and not a closure (ie. storable doesn't attempt to restore the lexical state of the sub { }).
Re: Benchmark on deserializing data
by GrandFather (Sage) on Apr 26, 2007 at 21:13 UTC

    You are comparing apples and oranges. You are poking at the reasons for the disjunction between the two techniques when you look at what you need to do to handle an array reference in a hash - you special case the split/join code, but Storable just goes and does it. The cost of "just goes and does it" is a little execution time.

    The surprise it not that Storable is slower, but how little slower it is compared to hand crafted code with up-front knowledge of the data structure to be serialized.

    If you want to spend the time to hand craft code every time the data structure changes you can squeeze a little more speed out of the result for sure, but that becomes a maintenance nightmare and a really good way to lose data if you manage to get the in and out code out of sync with each other. Unless execution time is critical I'd strongly advise you go the Storable route.

    DWIM is Perl's answer to Gödel
Re: Benchmark on deserializing data
by dont_you (Hermit) on Apr 26, 2007 at 21:37 UTC

    1- b() is faster than a() because the extra @ary. When benchmarking very fast operations, the overhead of creating variables is not negligible. Besides that, in a() you created a %hash on each iteration of the benchmark, in b() you are reusing %hash_b (forgotten the 'my'?)

    2- Same as above... you are not doing the same thing. In c() a %hash is created on each iteration, in d() a global %hash_d is used. If a 'my' is added before %hash_d, then d() is faster than c()

    3- Well... Storable is slower than split (from my experience, I can tell you that split is very very fast), so... why use Storable?

    - Is less code, faster to write because that, and you don't need to test, debug or document it yourself.

    - Is safer. For example, your example fails if the data contains '|'. And if you add escape sequences decoding, how much faster will still be split?

    - Later you can change the serialized structure without updtaing the serializer and deserializer.

    - Is not so slow, is no near half slow than split based code. The following sub is faster than a():

    sub { my $hash_e = thaw($storable_serialized ); } # and then use $hash_e->{'2'}

    Is easier, bah :-) Anyway, you probably will end spending more time fetching the data from the DB than deserializing it.

    Hope this helps, Josť

      another option is XML::Dumper. it's much slower, even then Storable, but it shares all of the same benefits as Storable mentioned by Jose ( and i'll second that the fetch/store will possibly be a bigger bottle neck then the serialize/deserialize too) but also has the advantage of storing in a human readable format, and that's useful for debuging.

      here's the results w/ e = XML::Dumper, but with freeze and pl2xml moved into the comparison functions as well (for a round-trip idea, fwiw) ---------

      shift8@2axxon:~$ perl Benchmark: running a, b, c, d, e for at least 1 CPU seconds... a: 1 wallclock secs ( 1.14 usr + 0.00 sys = 1.14 CPU) @ 25 +784.21/s (n=29394) b: 1 wallclock secs ( 1.09 usr + 0.00 sys = 1.09 CPU) @ 32 +879.82/s (n=35839) c: 1 wallclock secs ( 1.05 usr + 0.01 sys = 1.06 CPU) @ 46 +10.38/s (n=4887) d: 1 wallclock secs ( 1.05 usr + 0.00 sys = 1.05 CPU) @ 42 +65.71/s (n=4479) e: 1 wallclock secs ( 1.04 usr + 0.00 sys = 1.04 CPU) @ 32 +2.12/s (n=335) Rate e d c a b e 322/s -- -92% -93% -99% -99% d 4266/s 1224% -- -7% -83% -87% c 4610/s 1331% 8% -- -82% -86% a 25784/s 7905% 504% 459% -- -22% b 32880/s 10107% 671% 613% 28% -- <perldata> <hashref memory_address="0x82252f8"> <item key="1"> <arrayref memory_address="0x814bc28"> <item key="0">123</item> <item key="1">456</item> <item key="2">678</item> </arrayref> </item> <item key="2">value_2</item> <item key="3">value_3</item> <item key="4">value_4</item> <item key="5">value_5</item> <item key="6">value_6</item> <item key="7">value_7</item> <item key="8">value_8</item> </hashref> </perldata>
      Thanks to all but esp. to you coz unknowingly you've explained well so I now understand the reason for the output I gained but you as well have revealed the obvious which I had missed completely by giving the example of $hash_e.

      The function I've wrote the benchmark for needs to return a hash-ref anyway. - Shame on me not thinking about it before :)

      Thx and greats from Europe.


Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://612265]
Approved by Corion
Front-paged by Corion
[Corion]: If you take Android as the baseline, that makes all three a likely contender
[choroba]: 22 points until level 22
[Eily]: I guess your about to turn 22 yourself then choroba :P
[Eily]: in 22 days
[ww]: 22 hrs and 22.22 minutes?

How do I use this? | Other CB clients
Other Users?
Others perusing the Monastery: (13)
As of 2017-09-25 14:26 GMT
Find Nodes?
    Voting Booth?
    During the recent solar eclipse, I:

    Results (280 votes). Check out past polls.