Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

A fix for merge keys in YAML::XS, YAML::Syck

by bv (Friar)
on Dec 18, 2009 at 21:52 UTC ( #813443=perlmeditation: print w/replies, xml ) Need Help??

Ok, so not a fix, but a workaround. The problem is fairly well described in this blog post and this bug report for YAML::XS. The merge keys functionality of YAML is not supported by either YAML::XS or YAML::Syck, so this script:

#!/usr/bin/perl use strict; use warnings; use YAML::XS; use Data::Dumper; my $data; {local $/;$data = <DATA>}; print Dumper Load $data; __DATA__ --- key1: &id01 subkey_a: asdf subkey_b: qwer key2: <<: *id01 subkey_a: foo subkey_c: bar ...

produces this output:

$VAR1 = { 'key2' => { 'subkey_a' => 'foo', 'subkey_c' => 'bar', '<<' => { 'subkey_b' => 'qwer', 'subkey_a' => 'asdf' } }, 'key1' => $VAR1->{'key2'}{'<<'} };

The expected functionality is for the keys of key1 to be merged into the keys of key2, with key2 taking precedence. Not having the time or resources to dig into the XS guts of either module, my solution was a simple recursion function to traverse the data structure, merging the '<<' references with their containing objects:

#!/usr/bin/perl use strict; use warnings; use YAML::XS; use Data::Dumper; my $data; {local $/;$data = <DATA>}; print Dumper mergekeys( Load $data ); sub mergekeys { my ($ref) = @_; my $type = ref $ref; if ($type eq 'HASH') { my $tmphref = $ref->{'<<'}; if ($tmphref) { die "Merge key does not support merging non-hashmaps" unless (ref $tmphref eq 'HASH'); my %tmphash = %$tmphref; delete $ref->{'<<'}; %$ref = (%tmphash, %$ref); } mergekeys($_) for (values %$ref); } elsif ($type eq 'ARRAY') { mergekeys($_) for (@$ref); } return $ref; }

This produces the much more accurate data structure:

$VAR1 = { 'key2' => { 'subkey_b' => 'qwer', 'subkey_a' => 'foo', 'subkey_c' => 'bar' }, 'key1' => { 'subkey_b' => 'qwer', 'subkey_a' => 'asdf' } };

Obviously, this is not a very robust solution. It makes no checks for cyclical data, potentially expands the data structure due to dereferencing some addresses, and its recursive nature means it would rapidly consume resources when used on very deep data structures. But for the needs that I had, it worked pretty well. Hopefully the maintainers of YAML::XS and YAML::Syck can provide this functionality soon, since it seems that their respective underlying C libraries support merge keys just fine.

UPDATE: Found a couple easy ways to avoid recursion. See my reply below.


@_=qw; Just another Perl hacker,; ;$_=q=print "@_"= and eval;

Replies are listed 'Best First'.
Re: A fix for merge keys in YAML::XS, YAML::Syck
by bv (Friar) on Dec 21, 2009 at 16:34 UTC

    Well, I spent some more time thinking about the problem, and I came up with a slightly better algorithm. Instead of a recursive, depth-first approach, I used an iterative, breadth-first algorithm that I arrived at while trying to optimize my previous attempt into a tail-recursive scheme. Code and benchmarks follow.

    And the results of the benchmark:

    Rate recursive tail_call loop recursive 951/s -- -6% -14% tail_call 1016/s 7% -- -9% loop 1111/s 17% 9% --

    I was surprised that the loop was that much faster than the tail-recursion, since it amounts to about the same thing (go to top of loop, check condition, execute or return). It is important to note, though, that I was only able to replace the goto with a while loop because my function was intended to be run for its side-effects on the data structure. If I needed to accumulate a return value, the goto would have been the better solution.


    print map{(split//,'hark, suPerJacent other l')[$_]}(11,7,6,16,5,1,15,18..23,8..10,24,17,0,12,13,3,14,2,4);
      The original version may fail to merge keys if the merged hash is a sibling of the merging hash and itself merges a sibling. Success or failure then depends on the order in which hash values are returned by keys()!
      key1: &key1 a: a b: b c: c key2: &key2 <<: *key1 d: d e: e f: f key3: <<: *key2 g: g h: h i: i ...
      If key2 is pushed onto @_ before key3 all is well, but if not there will still be a merge key pointing at key1 after key2 has been merged into key3. Replacing the check for the existence of a mergekey with a while loop fixes that, because then a new merge will be performed and the outer while loop traversing @_ will not move on while there is still/again a merge key.
      sub mergekeys_loop { my ( $orig ) = @_; while ( my $ref = shift ) { my $type = ref $ref; if ( $type eq 'HASH' ) { # my $tmphref = $ref->{'<<'}; # if ( $tmphref ) { while ( my $tmphref = $ref->{'<<'} ) { die "Merge key does not support merging non-hashmaps" unless ( ref $tmphref eq 'HASH' ); my %tmphash = %$tmphref; delete $ref->{'<<'}; %$ref = ( %tmphash, %$ref ); } push @_, grep { ref eq 'HASH' or ref eq 'ARRAY' } values % +$ref; } elsif ( $type eq 'ARRAY' ) { push @_, grep { ref eq 'HASH' or ref eq 'ARRAY' } @$ref; } } return $orig; }
        I wrote:

        The original version may fail to merge keys if the merged hash is a sibling of the merging hash and itself merges a sibling. Success or failure then depends on the order in which hash values are returned by keys()!

        That should have read:

        The original version of mergekeys_loop() may fail to merge keys if the merged hash is a sibling of the merging hash and itself merges a sibling. Success or failure then depends on the order in which hash values are returned by values()!

        Sorry!

Re: A fix for merge keys in YAML::XS, YAML::Syck
by vtsj (Initiate) on Jan 21, 2013 at 00:49 UTC

    Ran into the same problem of '<<' not being supported in YAML::XS today (in 2013 no less, for shame). Many thanks for your effort, this post proved helpful.

    I extended your recursive solution to detect cycles, and also to allow for multiple hashes to be merged (in left to right order). The latter is useful for my intended purposes, but is apparently also supported by Ruby's YAML parser, so I figured it would make a nice addition. Ideally this work would be done to YAML::XS, but like you I have the time nor resources. I'm by no means a Perl programmer, I merely tinker, so please forgive any inadequacies.

    From your benchmarks it appears that your breadth-first loop solution is the faster method, but the cycle detection relies on the recursive nature of the implementation to be able to detect that it has re-encountered a hash- or arrayref that it has already previously encountered while resolving an inheritance chain. It uses an a fairly blunt method of keeping track of these: by maintaining a stack of the encountered hash/arrayrefs throughout the recursion calls. I've made no effort towards optimizing this, nor the particular way of searching the stack -- if anybody can find better ways, I'd love to hear about them. Of course, the remarks you've made about the initial recursive implementation also remain valid.

    For the cyclic references, I've chosen to keep them in the output. I.e., you would be able to infinitely traverse the data structure via the cycle. Alternatively, you can also error out and leave it up to the user to not have cycles in their YAML files. Ruby's implementation apparently simply removes the cycle and keeps on trucking, so I figured I might as well do the same (at least for the keep trucking part).

    #!/usr/bin/perl use strict; use warnings; use YAML::XS; use Data::Dumper; my $data; {local $/;$data = <>}; sub mergekeys { return _mergekeys($_[0], []); } sub _mergekeys { my $ref = shift; my $resolveStack = shift; my $reftype = ref $ref; # If this hash or array is already on the resolution stack, then s +omewhere, a child data structure is trying to inherit from one of its + parents, # and hence by extension trying to inherit itself. if ($reftype =~ /HASH|ARRAY/ and (grep $_ == $ref, @$resolveStack) + > 0) { # Halt and catch fire, or store the cyclic reference and not p +rocess it further. Not complaining seems to # be the behaviour of ruby's YAML parser, so let's go for that +. #die "Cyclic inheritance detected: ".($ref)." is already on th +e resolution stack!\n" . # "Dump of cyclic data structure (may have inheritance alre +ady partially resolved):\n".Dumper($ref); return $ref; } if (ref($ref) eq 'HASH') { push @$resolveStack, $ref; if (exists $ref->{'<<'}) { my $inherits = $ref->{'<<'}; # can be either a single href +, or an array of hrefs die "Undefined value for merge key '<<' in ".Dumper($ref) +unless defined $inherits; # catch edge cases that YAML::XS won't catc +h, like "<<: &foo" die "Merge key does not support merging non-hashmaps" unle +ss ref($inherits) =~ /HASH|ARRAY/; $inherits = [$inherits] if ref($inherits) eq 'HASH'; # nor +malize for further processing # For each of the hashes/arrays we're inheriting, have the +m resolve their inheritance first before applying them onto ourselves +. # Also, remove the '<<' reference only afterwards, since b +y recursion these will have already been removed from our inheritees, + + # and this also allows us to show the c +yclic reference by dumping out the structure when we detect one. foreach my $inherit (@$inherits) { $inherit = _mergekeys($inherit, $resolveStack); %$ref = (%$inherit, %$ref); } delete $ref->{'<<'}; } _mergekeys($_, $resolveStack) for (values %$ref); die "Fatal error: imbalanced recursion stack in _mergekeys. Th +is likely implies a programming error and/or a YAML file from hell." +unless pop(@$resolveStack) eq $ref; } elsif (ref($ref) eq 'ARRAY') { push @$resolveStack, $ref; _mergekeys($_, $resolveStack) for (@$ref); die "Fatal error: imbalanced recursion stack in _mergekeys. Th +is likely implies a programming error and/or a YAML file from hell." +unless pop(@$resolveStack) eq $ref; } return $ref; } my $yml = Load($data); mergekeys $yml; print Dumper $yml;
    Here's the most convoluted test file I came up with:
    --- key0: &id00 zero: 0 key1: &id01 <<: *id00 subkey_a: asdf subkey_b: qwer Z: "key1.Z" key2: &id02 everyday: apple X: 1 Y: 2 Z: "key2.Z" A: - <<: *id00 zero: "00" - <<: *id01 - <<: *id02 #cycle! everyday: shuffling Y: 3000 <<: *id00 key3: &id03 <<: [*id01, *id02, *id01] subkey_a: foo subkey_c: bar deeper: challenger: deep <<: *id00
    And the output it produces:
    $ perl merge_keys.pl cyclic_test2.yml $VAR1 = { 'key2' => { 'everyday' => 'apple', 'Z' => 'key2.Z', 'A' => [ { 'zero' => '00' }, { 'Z' => 'key1.Z', 'subkey_b' => 'qwer', 'subkey_a' => 'asdf', 'zero' => 0 }, { 'A' => $VAR1->{'key2'}{'A'}, 'Z' => 'key2.Z', 'everyday' => 'shuffling', 'X' => 1, 'zero' => 0, 'Y' => 3000 } ], 'X' => 1, 'zero' => 0, 'Y' => 2 }, 'key1' => { 'subkey_b' => 'qwer', 'Z' => 'key1.Z', 'subkey_a' => 'asdf', 'zero' => 0 }, 'key0' => { 'zero' => 0 }, 'key3' => { 'A' => $VAR1->{'key2'}{'A'}, 'subkey_b' => 'qwer', 'subkey_a' => 'foo', 'X' => 1, 'Y' => 2, 'everyday' => 'apple', 'Z' => 'key1.Z', 'subkey_c' => 'bar', 'zero' => 0, 'deeper' => { 'challenger' => 'deep', 'zero' => 0 } } };

    Notice that the cyclic references are maintained, and the values of "everyday" and "Y" are correctly overridden by their local values in key2.A[2] in the original YML file.

    Note also that key3's final value for Z is the one inherited from key1, due to *id01 appearing a second time after the *id02 in key3's inheritance list.

    Compare and contrast to ruby's result:

    $ ruby -ryaml -rpp -e 'x = YAML.load_file("cyclic_test2.yml"); print P +P.pp(x);' {"key0"=>{"zero"=>0}, "key1"=>{"zero"=>0, "subkey_a"=>"asdf", "subkey_b"=>"qwer", "Z"=>"key +1.Z"}, "key2"=> {"everyday"=>"apple", "X"=>1, "Y"=>2, "Z"=>"key2.Z", "A"=> [{"zero"=>"00"}, {"zero"=>0, "subkey_a"=>"asdf", "subkey_b"=>"qwer", "Z"=>"key1.Z" +}, {"everyday"=>"shuffling", "X"=>1, "Y"=>3000, "Z"=>"key2.Z"}], "zero"=>0}, "key3"=> {"zero"=>0, "subkey_a"=>"foo", "subkey_b"=>"qwer", "Z"=>"key1.Z", "everyday"=>"apple", "X"=>1, "Y"=>2, "A"=> [{"zero"=>"00"}, {"zero"=>0, "subkey_a"=>"asdf", "subkey_b"=>"qwer", "Z"=>"key1.Z" +}, {"everyday"=>"shuffling", "X"=>1, "Y"=>3000, "Z"=>"key2.Z"}], "subkey_c"=>"bar", "deeper"=>{"challenger"=>"deep", "zero"=>0}}}

    Notice that Ruby has (must have) performed the cyclic merge at least partially, as evidenced by the presence of the mapping "X"=>1 in the resulting hash for key2.A[2]. However, the mapping "zero"=>0 is missing from this value, which is present in Perl's result. Indeed, key2 inherits from id00, which imports the mapping "zero"=>0.

    This difference may be due to the order in which merges are performed. Since the cyclic merge appears in the document before the key2's <<: *id00 does, maybe Ruby had not yet processed that merge at the time of doing the cyclic merge. I'm just guessing here, and I don't know which behaviour is correct, but at least you know about the difference.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlmeditation [id://813443]
Approved by toolic
Front-paged by planetscape
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others browsing the Monastery: (6)
As of 2019-12-14 15:39 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?