Efficient partial deep cloning

Ovid has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to create a replacement for AI::Prolog (one that uses Perl data structures and doesn't require knowledge of Prolog) and I've hit a stumbling block and my solution seems really nasty.

Imagine an arbitrarily complex data structure:

  [
    $var1,
    undef,
    [
      undef,
      undef,
      { 'foo' => $var2 }
    ],
    {
      'bar' => $var3
    }
  ];
[download]

Let's say I have an arbitrary number of other data structures with identical structures, but different values. For every variable in the structure above (it's an instance of something called a "logic" variable), how can I efficiently assign it the corresponding value in the each of the other structures? As a simple example, let's say that structure looks like this:

  my $unknown = [ undef, undef, $var ];
[download]

If I have three other structures like this:

  my @structures = ( 
    [ qw/ 1 2 3 / ],
    [ qw/ this that other / ],
    [ qw/ un deux trois / ],
  );
[download]

Then I'd like something like this:

  foreach my $structure (@structures) {
    bind($unknown, $structure);
    print $var->value;
  }
[download]

That would print "3", "other", and "trois", in sequence.

As mentioned, the data structures might be arbitrarily complex and currently the only data types allowed are scalars, array refs, hashrefs, and regexes and due to how the system is built, the structure are guaranteed to be identical in shape, including size of arrays and hashes.

I am assuming that I need an easy way to spider the $unknown structure and cache the positions of all of those logic variables. The only way I've thought of it so far is to build up a string representation of the data structure and use nasty string eval tricks to get at them (sort of like XPath for Perl data structures). For example, the following will assign "3" to the logic variable:

  my $struct = [
    'whaa!',
    undef,
    [undef, undef, { foo => 3}],
    { bar => 'baz'},
  ];
  my $path = "[2][2]{foo}";
  $var->bind( eval "\$struct->$path" );
[download]

This seems really nasty and any advice would be welcome.

Note that things like Data::Diver and Class::XPath seem great, but they don't solve the general algorithm problem of:

  foreach my $path (find_logic_var($unknown_struct)) {
      push @paths => $path;
  }
[download]

In other words, the paths are unknown up front and that's the major problem I want to solve efficiently. Maybe I should recursively walk the data structure while pushing path elements on a localized stack?

Cheers,
Ovid

New address of my CGI Course.

Comment on Efficient partial deep cloning Select or Download Code

Replies are listed 'Best First'.
Re: Efficient partial deep cloning by jbert (Priest) on Jan 08, 2008 at 12:06 UTC
Walking the data structure and finding the elements is one way to do it (can there be more than one such elt? Do you need to worry about cycles in the structure?). But it seems to me that these data structures have meta-information which you aren't currently storing. If you want efficiency you should store that information rather than recalculate it via a search. One way to do this would be to wrap accesses to the data structure so that when a logic var is added to it, it's position is remembered (perhaps in some kind of inside-out object attribute stash). I'd do it this way, with a class around the data structure (add a "addLogicVar" method) although you could use TIE I guess.	[reply]
Re^2: Efficient partial deep cloning by Ovid (Cardinal) on Jan 08, 2008 at 12:09 UTC
I forgot about cycles, so yes, I need to worry about those. Damn. Cheers, Ovid New address of my CGI Course.	[reply]
Re^3: Efficient partial deep cloning by exussum0 (Vicar) on Jan 08, 2008 at 15:32 UTC
Can't you use something like the reference of the current "thing" and the "thing" to be considered to be copied as a key in a hashset? Everytime you encounter those two things in that relationsip, process it once and only once? Or.. create an algorithm to do a breadth first search (can be less memory intense) to discover all edges and copy them in a graph?	[reply]
Re^4: Efficient partial deep cloning by Ovid (Cardinal) on Jan 09, 2008 at 10:29 UTC
Re: Efficient partial deep cloning by Ovid (Cardinal) on Jan 08, 2008 at 12:04 UTC
FYI: I do have a sample implementation which works. From the tests: $data = { foo => 3, bar => [ qw/ this that 3 / ], 3 => undef, baz => { trois => 3, quatre => [ qw/ 1 2 3 4 / ], } }; ok $paths = path( $data, 3 ), 'Fetching paths for matching data should succeed'; my @expected = sort ( '{bar}[2]', '{baz}{trois}', '{baz}{quatre}[2]', '{foo}' ); $paths = [ sort @$paths ]; is_deeply $paths, \@expected, '... and we should be able to match complex data structures'; foreach my $path (@$paths) { is eval "\$data->$path", 3, '... and each element should have the correct value'; } [download] And the some rough code: use Scalar::Util 'reftype'; my %path = ( ARRAY => \&_array_path, HASH => \&_hash_path, ); sub path { my ( $data, $search ) = @_; my $type = reftype $data; my $find_paths = $path{$type}; return $find_paths->( $data, $search ); } sub _array_path { my ( $data, $search ) = @_; my @paths; foreach my $i (0 .. $#$data) { my $item = $data->[$i]; my $type = reftype $item; my $current_index = "[$i]"; if ( !$type && $item eq $search ) { # XXX push @paths => $current_index; } elsif ( my $find_paths = $path{$type} ) { my @current_paths = map { "$current_index$_" } @{ $find_paths->( $item, $search ) }; push @paths => @current_paths; } } return \@paths; } sub _hash_path { my ( $data, $search ) = @_; my @paths; foreach my $key (keys %$data) { my $item = $data->{$key}; my $type = reftype $item; my $current_key = "{$key}"; if ( !$type && $item eq $search ) { # XXX push @paths => $current_key; } elsif ( my $find_paths = $path{$type} ) { my @current_paths = map { "$current_key$_" } @{ $find_paths->( $item, $search ) }; push @paths => @current_paths; } } return \@paths; } [download] The code itself works fine, but the tests show that using it (the string eval) seems really clumsy. It works fine for what I need, but still looking for a better approach (cleaner and faster). Cheers, Ovid New address of my CGI Course.	[reply] [d/l] [select]
Re: Efficient partial deep cloning by runrig (Abbot) on Jan 09, 2008 at 01:10 UTC
I don't know how this compares against a string eval (probably depends on how deep the structure is), but if the path is stored as something like `qw(A 2 H foo H bar A 1)`, then you could traverse down a path like so (untested, unthought out rough idea): `my @path = qw( A 2 H foo H bar A 1 ); my $data = [ "foo", "bar", { foo => { bar => [ 5, 6, ], }, }, ]; while ( @path ) { my $type = shift @path; my $index = shift @path; if ( $type eq 'A' ) { $data = $data->[$index]; } else { $data = $data->{$index}; } } print $data; # prints "6" (I hope)` [download] Of course, if you want to assign to something in the path, you'd have to traverse the path up until the last item, then assign, etc. Ehh, I'm not sure if or when this might be more efficient or any better than the eval method, but there it is :-)	[reply] [d/l] [select]
Re: Efficient partial deep cloning by sfink (Deacon) on Jan 09, 2008 at 07:14 UTC
Can you "flatten" each of the data structures? By that, I mean store everything in two ways simultaneously: in the actual structure, and also in a separate table that maps string representations of the paths to the appropriate value or variable in the structure. Or perhaps a reference to the actual value. Or separate the flattened representation into data and variables -- then, when you bind($s1,$s2), you'd scan through the keys to the table of variables, and look up each one in the flattened representation of the other value. It would be a simple string lookup in a hashtable at that point. When to generate or update the flattened representation would depend on how these structures are populated. If you were to generate all of them at once, then yes, recursively walk the data structure. I don't think localizing the stack helps, though -- I think that would require copying the whole stack every time you localized it with the old value plus one new path token. Better to just push the token on and pop it back off after you're done. Actually, if you're doing it my way (with a string representation of the path), then you would just call bind(..., $path . $token) recursively and it will manage your stack for you. And if you're maintaining the flattened version at all times, you just need to have the path so far available anywhere you're mutating things. Why are you worried about this particular optimization already? It sounds like you aren't done with the rest of the algorithm. You might want to make it work first, just in case this optimization turns out to be invalid. Can variables be non-leaf nodes in your structures? If so, then you may have to do full searches rather than simple cached path lookups. Can the same variable appear more than once? If so, then you have to do speculative bindings or loop through all possibilities or something. Can the right-hand side also contain variables? Then you need to be able to bind the two variables together, and if they can appear more than once, I think the speculation gets hairier. By the way, I think your function bind() should be called unify(). The terminology I am familiar with is that you unify the two structures, resulting in a set of bindings (possibly recorded via calls to bind()). But perhaps you're doing things differently from what I remember; I can't figure out when you would ever want to unify a single structure with multiple other structures in succession. Then again, it's been a very long time since I last thought about this stuff.	[reply]
Re^2: Efficient partial deep cloning by Ovid (Cardinal) on Jan 09, 2008 at 10:22 UTC
Unification and data binding, in this context, are different things. Prolog: `gives(tom, book, SOMEONE) :- person(SOMEONE), likes(tom, SOMEONE).` [download] Let's look at the `like( tom, SOMEONE )` bit. Prolog would fetch all of the `like/2` facts and attempt to unify those facts, one at a time, with `like( tom, SOMEONE )`. As soon as a fact matches (unifies), then the `SOMEONE` variable will be bound to the corresponding value. So if we have these facts: `likes( mary, tom ). likes( tom, susan ).` [download] The first fact won't unify and `SOMEONE` won't be bound to `tom`. When the second fact is encountered, unification occurs and `SOMEONE` is bound to the value `susan`. Once that's bound, then the engine goes to `person(SOMEONE)` and since the variable is already bound, we are attempting to unify this fact with a fact in the database which asserts that `susan` is, in fact, a person. If we find that fact, we get to the top of the rule and know that `tom` gives a book to `susan`. So the terminology is right, but this means we tend to have exhaustive, depth-first backtracking searches. If we have lots of facts in our database, these searches can be very slow. Having done a lot of work in this area before, I know how incredibly common (and slow) the binding and unbinding of data can be, so while I generally tell people not to do premature optimization, like many others, I caution folks that this isn't always true. If you're writing a ray-tracing program and you need it to be fast, you know you're probably not going to reach for Perl first. (Of course, if you want predicate logic, you also don't want to reach for Perl first, but I'm strange that way). Cheers, Ovid New address of my CGI Course.	[reply] [d/l] [select]
Re: Efficient partial deep cloning by toma (Vicar) on Jan 11, 2008 at 05:13 UTC
Maybe you can use one of the B:: modules. I have only used them a couple of times, but it seems like they do the reflection that you are looking for. You end up in the internals because that's where the structure is. For example: `use B::XPath; sub some_function { my $struct = [ 'whaa!', undef, [ undef, undef, { foo => 3 } ], { bar => 'baz' }, ]; } my $node = B::XPath->fetch_root( \&some_function ); print_children($node, 0); sub print_children { my $node= shift; my $indent= shift; foreach my $child ($node->get_children()) { print " " for (0..$indent); print $child->get_name()."=".$child."\n"; $indent++; print_children($child, $indent); $indent--; } }` [download] It should work perfectly the first time! - toma	[reply] [d/l]


Don't ask to ask, just ask
	PerlMonks