Re: A more elegant way to filter a nested hash? Benchmarks!

The results are in!

Thank you all very much for your suggestions.

I tried my hand at writing an iterative version, but it didn't perform as well as I'd hoped. Can anyone suggest a better implementation?

Not surprisingly, the version which just deletes keys from the source hash is by far the fastest.

More surprising to me was that the recursive version is quite a lot faster than both the map/grep and slice based solutions.

I didn't spend much time trying to get mr_ron's version working. I may take a closer look at that one later this weekend.

Let me know your thoughts!

#! /usr/bin/env perl

use strict;
use warnings;

use Carp;
use Test::More;
use Benchmark qw(timethese cmpthese);

use Deep::Hash::Utils qw(reach nest deepvalue);

my $t = {
  source => {
    f1 => 'garbage',
    f2 => 'more garbage',
    f3 => 'important data',
    f4 => {
      this => 'sub hash',
      is   => 'garbage'
    },
    f5 => {
      f6 => 'more important data',
      f7 => {
        more => 'garbage',
        f8   => 'important data',
      },
      f9 => 'garbage',
    },
    f10 => [ 'important', 'data' ],
    f11 => [ 'more',      'garbage' ]
  },
  filter => {
    f3 => 1,
    f5 => {
      f6 => 1,
      f7 => {
        f8 => 1
      }
    },
    f10 => 1
  },
  output => {
    f3 => 'important data',
    f5 => {
      f6 => 'more important data',
      f7 => {
        f8 => 'important data',
      }
    },
    f10 => [ 'important', 'data' ],
  },
};

sub hash_filter_recursive {
  my $source = shift;
  my $filter = shift;

  my %output;

  foreach ( keys %$filter ) {
    if ( exists $source->{$_} ) {

      if ( ref $filter->{$_} eq 'HASH' ) {
        croak "bad filter: on '$_', expected HASH\n"
          unless ( ref $source->{$_} eq 'HASH' );

        $output{$_} = hash_filter_recursive( $source->{$_}, $filter->{
+$_} );
      }
      else {
        $output{$_} = $source->{$_};
      }

    }
  }

  return \%output;
}

sub hash_filter_iterative {

  # I expected this one to be faster...
  # Can anyone suggest a better implementation?

  my $source = shift;
  my $filter = shift;

  my $output = {};

  my @queue = ( [ $source, $filter, $output ] );

  while ( my $a = shift @queue ) {
    my ( $s, $f, $o ) = @{$a};
    foreach ( keys %$f ) {
      if ( exists $s->{$_} ) {

        if ( ref $f->{$_} eq 'HASH' ) {
          croak "bad filter: on '$_', expected HASH\n"
            unless ( ref $s->{$_} eq 'HASH' );

          $o->{$_} = {};
          push @queue, [ $s->{$_}, $f->{$_}, $o->{$_} ];
        }
        else {
          $o->{$_} = $s->{$_};
        }

      }
    }
  }

  return $output;
}

sub hash_filter_delete {
  my $source = shift;
  my $filter = shift;

  foreach ( keys %$source ) {
    if ( exists $filter->{$_} ) {

      if ( ref $filter->{$_} eq 'HASH' ) {
        croak "bad filter: on '$_', expected HASH\n"
          unless ( ref $source->{$_} eq 'HASH' );

        hash_filter_delete( $source->{$_}, $filter->{$_} );
      }

    }
    else {

      delete $source->{$_};

    }
  }

  return $source;
}

sub hash_slice {

  # contributed by Veltro

  my $s         = $_[0];
  my $f         = $_[1];
  my $n         = {};
  my @keysToGet = keys %{$f};
  @{$n}{@keysToGet} = @{$s}{@keysToGet};
  foreach ( keys %{$n} ) {
    if ( ref $n->{$_} eq 'HASH' ) {
      $n->{$_} = hash_slice( $s->{$_}, $f->{$_} );
    }
  }
  return $n;
}

sub map_grep {

  # contributed by shmem

  my $source = shift;
  my $filter = shift;

  return {
    map {
          ref $filter->{$_} eq 'HASH'
        ? ref $source->{$_} eq 'HASH'
          ? ( $_, map_grep( $source->{$_}, $filter->{$_} ) )
          : croak "bad filter: on '$_', expected HASH\n"
        : ( $_, $source->{$_} )
    } grep { exists $source->{$_} } keys %$filter
  };
}

sub map_grep_2 {

  # contributed by shmem

  return {
    map {
          ref $_[1]->{$_} eq 'HASH'
        ? ref $_[0]->{$_} eq 'HASH'
          ? ( $_, map_grep_2( $_[0]->{$_}, $_[1]->{$_} ) )
          : croak "bad filter: on '$_', expected HASH\n"
        : ( $_, $_[0]->{$_} )
    } grep { exists $_[0]->{$_} } keys %{ $_[1] }
  };
}

sub hash_deep_utils {

  # contributed by mr_ron

  my ( $source, $filter ) = @_;
  my %rc;

  while ( my @l = reach($filter) ) {
    pop @l;
    if ( defined( my $source_val = deepvalue( $source, @l ) ) ) {

      # hint: work around nest behavior on even vs odd key count
      nest( \%rc, @l )->{ $l[$#l] } = $source_val;
    }
  }

  \%rc;
}

is_deeply( $_->( $t->{source}, $t->{filter} ), $t->{output} )
  foreach ( \&hash_filter_recursive, \&hash_filter_iterative,
  \&hash_filter_delete, \&hash_slice, \&map_grep, \&map_grep_2,
  \&hash_deep_utils );

done_testing();

cmpthese(
  1000000,
  {
    recursive  => sub { hash_filter_recursive( $t->{source}, $t->{filt
+er} ) },
    iterative  => sub { hash_filter_iterative( $t->{source}, $t->{filt
+er} ) },
    delete     => sub { hash_filter_delete( $t->{source},    $t->{filt
+er} ) },
    map_grep   => sub { map_grep( $t->{source},              $t->{filt
+er} ) },
    map_grep_2 => sub { map_grep_2( $t->{source},            $t->{filt
+er} ) },
    slice      => sub { hash_slice( $t->{source},            $t->{filt
+er} ) },
  }
);
[download]

ok 1
ok 2
ok 3
ok 4
ok 5
ok 6
not ok 7
#   Failed test at ./benchmark.pl line 209.
#     Structures begin differing at:
#          $got->{f5}{f7} = 'f8'
#     $expected->{f5}{f7} = HASH(0x7fd932364b88)
1..7
               Rate iterative     slice  map_grep map_grep_2 recursive
+    delete
iterative  191571/s        --       -3%      -20%       -23%      -30%
+      -57%
slice      198413/s        4%        --      -17%       -20%      -27%
+      -56%
map_grep   240385/s       25%       21%        --        -3%      -12%
+      -46%
map_grep_2 248139/s       30%       25%        3%         --       -9%
+      -45%
recursive  271739/s       42%       37%       13%        10%        --
+      -39%
delete     448430/s      134%      126%       87%        81%       65%
+        --
# Looks like you failed 1 test of 7.
[download]

Best,

Jim

πάντων χρημάτων μέτρον έστιν άνθρωπος.

Comment on Re: A more elegant way to filter a nested hash? Benchmarks! Select or Download Code

Replies are listed 'Best First'.
Re^2: A more elegant way to filter a nested hash? Benchmarks! by vr (Curate) on Jun 04, 2018 at 19:04 UTC
Not surprisingly, the version which just deletes keys from the source hash is by far the fastest. No surprise here, because keys were deleted only once during single test run. In all the other 1e6 runs, the "else" branch was never reached. I'm not suggesting the benchmarks could be fixed, e.g. by deep-cloning the $t for each run for every contestant -- because I don't see the point, frankly. The OP uses word "remove" twice, but apart from this `hash_filter_delete` in benchmarks, nothing was removed in the whole thread, but filtered shallow copies were constructed. Was that the intention? Some inconsistency, at least, in that strings are duplicated, but arrays cloned. The useful payload so tiny, compared to overhead of e.g. maintaining queue (pushing, shifting, anon arrays creation), and recursion depth so shallow, that iterative approach will show no advantages.	[reply] [d/l]
Re^3: A more elegant way to filter a nested hash? Benchmarks! by jimpudar (Pilgrim) on Jun 04, 2018 at 19:18 UTC
Ah, can't believe I missed the keys were only being deleted once! `hash_filter_delete` is indeed the only function which modifies the source hash instead of constructing a new one. See this node for the inspiration. As for the tiny payload, the data I am actually using this hash_filter function on is of similar size and depth. I could try out some larger payloads with more levels of hash nesting to see if the iterative version does any better. Thanks, Jim πάντων χρημάτων μέτρον έστιν άνθρωπος.	[reply] [d/l]
Re^2: A more elegant way to filter a nested hash? Benchmarks! by Veltro (Hermit) on Jun 06, 2018 at 21:49 UTC
Hi jimpudar, thanks for posting those results. Although the tests are probably fine for the original problem that you posted I had the same concerns as vr. I think the payload is too small for testing so I started to do some testing myself. Just for the fun of it, I creating some tests that use much bigger hash structures in memory (my aim was memory sizes up to 1.2 to 1.8 Gb of data). Also I wanted some variance in the nature on how the hash is created. E.g. Hashes with lot's of values and fewer hashes vs. Hashes with lot's of hashes containing fewer values. And small filters vs. big filters Here are the results of my tests: 1. Hash depth = 10. Lot's of values vs. lot's of hashes. Generating in loops that run 17 times. ~300k hashes. Filter 70% hashes and 88% values of original hash. 10 test cycles. Memory usage approx. 1.4 Gb. `Number of Hashes = 305798 Number of Values = 4892785 Number of Hashes Filtered = 268718 Number of Values Filtered = 3435051 Filtering 88% Hashes Filtering 70% Values s/iter map_grep_2 map_grep slice recursive iterati +ve map_grep_2 6.29 -- -1% -8% -10% -1 +6% map_grep 6.22 1% -- -7% -9% -1 +5% slice 5.81 8% 7% -- -2% - +9% recursive 5.68 11% 10% 2% -- - +7% iterative 5.27 19% 18% 10% 8% +--` [download] iterative seems to perform very well under these circumstances. I ran this test multiple times, occasionally recursive or slice wins 2. Hash depth = 10. Lot's of values vs. lot's of hashes. Generating in loops that run 17 times. ~400k hashes. Filter 50% hashes and 10% values of original hash. 10 test cycles. Memory usage approx. 1.7 Gb. `Number of Hashes = 405929 Number of Values = 6494881 Number of Hashes Filtered = 203883 Number of Values Filtered = 660915 Filtering 50% Hashes Filtering 10% Values s/iter iterative map_grep map_grep_2 slice recursi +ve iterative 1.67 -- -1% -1% -7% -1 +2% map_grep 1.65 1% -- -0% -6% -1 +1% map_grep_2 1.65 1% 0% -- -6% -1 +0% slice 1.55 7% 6% 6% -- - +5% recursive 1.48 13% 12% 12% 5% +--` [download] 3. Hash depth = 10. Lot's of Hashes vs. lot's of values. Generating in loops that run 5 times. ~1.6M hashes. Filter 87% hashes and 69% values of original hash. 10 test cycles. Memory usage approx. 2.4 Gb. `Number of Hashes = 1579338 Number of Values = 6317357 Number of Hashes Filtered = 1369612 Number of Values Filtered = 4362177 Filtering 87% Hashes Filtering 69% Values s/iter map_grep_2 slice map_grep recursive iterati +ve map_grep_2 10.7 -- -0% -3% -6% - +6% slice 10.7 0% -- -3% -6% - +6% map_grep 10.4 3% 3% -- -3% - +3% recursive 10.1 6% 6% 3% -- - +0% iterative 10.0 7% 7% 3% 0% +--` [download] 4. Hash depth = 10. Lot's of Hashes vs. lot's of values. Generating in loops that run 5 times. ~1.2M hashes. Filter 40% hashes and 10% values of original hash. 10 test cycles. Memory usage approx. 1.5 Gb. `Number of Hashes = 1226812 Number of Values = 4907253 Number of Hashes Filtered = 492351 Number of Values Filtered = 509410 Filtering 40% Hashes Filtering 10% Values s/iter iterative slice map_grep map_grep_2 recursi +ve iterative 2.56 -- -16% -17% -17% -2 +2% slice 2.15 19% -- -1% -1% - +7% map_grep 2.13 20% 1% -- -0% - +6% map_grep_2 2.12 20% 1% 0% -- - +6% recursive 1.99 28% 8% 7% 6% +--` [download] The last test shows recursive fast, however tests 2 and 4 are as far as I am concerned less interesting since they run a shorter time and only show a difference in a couple of tenth of seconds. After testing I think that the recursive function is still good solution, although my tests show that it is actually iterative that won the most complex schemes. Some of the algorithms seem to run slower or faster under different circumstances. I generated different hash structures and filter structures and I think that each of these differences may be more or less beneficial to the different algorithms. This I find more important because as far as I am concerned choosing the best algorithm should always be done on based on the biggest time loss. Eventually I have to conclude, what are we actually optimizing? Readable code vs. a win in one or two seconds for processing gigabytes of data? Sorry for not adding mr_ron's tests. I couldn't get it to work I did optimize the slice method though, it had an extra keys instruction that was not needed, but I also added the check for valid filter: `sub hash_slice { my $n ; my @keysToGet = keys %{$_[1]}; @{$n}{@keysToGet} = @{$_[0]}{@keysToGet}; foreach ( @keysToGet ) { if ( ref $_[1]->{ $_ } eq 'HASH' ) { croak "bad filter: on '$_', expected HASH\n" unless ( ref +$_[0]->{ $_ } eq 'HASH' ) ; $n->{$_} = hash_slice( $_[0]->{$_}, $_[1]->{$_} ); } } return $n ; }` [download] Here's the code that I used: Read more... (13 kB) Note: I posted on June 6th, but then I found out that the code was actually not working correctly... So I had to fix the code and update the Benchmark tests and conclusions on the 7th. Code is the last version I used.	[reply] [d/l] [select]
Re^3: A more elegant way to filter a nested hash? Benchmarks! by jimpudar (Pilgrim) on Jun 08, 2018 at 00:08 UTC
Very nice, thanks for putting this together! I think we have proved that for 99% of the time, the recursive solution is best as it is the easiest to understand and maintain. Best, Jim πάντων χρημάτων μέτρον έστιν άνθρωπος.	[reply]


Think about Loose Coupling
	PerlMonks