onegative has asked for the wisdom of the Perl Monks concerning the following question:

G'day most honorable Monks,

I have a problem that is obviously above my head and I would really appreciate a push start getting it solved.

I generate the Original_Group.xml using SQL and Perl but because this is outside the applications capability to create its own export/import XML with a parent/child Group relationships so I am having to make this up as I go.

The Sorted_Group.xml is the manual manipulation that is required of the Perl parser I am trying to build.

The resulting Sorted_Group.xml requires that a Group record be Inserted only if it does not contain any Sub-Group members or that a defined Sub-Group member(s) already exists (inserted previously) before it will be allowed to be inserted.

Just not sure how to handle sorting with multiple levels of members in the xml record.

Any and all help is greatly appreciated.

Thanks,
Danny

Original_Group.xml
<SystemXchange version="1.0"> <summary> <tool version="2.8 build: [070207]">Enterprise Migration Tool< +/tool> <source></source> <timestamp>May 30, 2008 @ 10:00:00 CDT</timestamp> </summary> <!--Begin Group Records--> <records type="group" operation="insert"> <record> <name>GOLF</name> <members> <Group>ECHO</Group> <Group>BETA</Group> <Group>DELTA</Group> <Group>FOXTROT</Group> </members> </record> <record> <name>ECHO</name> <members> <Group>CHARLIE</Group> <Group>DELTA</Group> </members> </record> <record> <name>CHARLIE</name> <members> <Group>ALPHA</Group> <Group>DELTA</Group> </members> </record> <record> <name>BETA</name> <members> <Group>FOXTROT</Group> <Group>CHARLIE</Group> </members> </record> <record> <name>FOXTROT</name> <members> <Group>ALPHA</Group> </members> </record> <record> <name>DELTA</name> <members> <Group>ALPHA</Group> <Group>FOXTROT</Group> </members> </record> <record> <name>ALPHA</name> <members> </members> </record> </records> </SystemXchange>


Sorted_Group.xml
<SystemXchange version="1.0"> <summary> <tool version="2.8 build: [070207]">Enterprise Migration Tool< +/tool> <source></source> <comments> </comments> <timestamp>May 30, 2008 @ 10:00:00 CDT</timestamp> </summary> <!--Begin Group Records--> <records type="group" operation="insert"> <record> <name>ALPHA</name> <members> </members> </record> <record> <name>FOXTROT</name> <members> <Group>ALPHA</Group> </members> </record> <record> <name>DELTA</name> <members> <Group>ALPHA</Group> <Group>FOXTROT</Group> </members> </record> <record> <name>CHARLIE</name> <members> <Group>ALPHA</Group> <Group>DELTA</Group> </members> </record> <record> <name>BETA</name> <members> <Group>FOXTROT</Group> <Group>CHARLIE</Group> </members> </record> <record> <name>ECHO</name> <members> <Group>CHARLIE</Group> <Group>DELTA</Group> </members> </record> <record> <name>GOLF</name> <members> <Group>ECHO</Group> <Group>BETA</Group> <Group>DELTA</Group> <Group>FOXTROT</Group> </members> </record> </records> </SystemXchange>

Replies are listed 'Best First'.
Re: Parent/Child Group Relationship in XML Imports
by Jenda (Abbot) on Jun 01, 2008 at 15:03 UTC

    Not exactly the order you specified in the example, but a correct order anyway:

    use strict; use warnings; no warnings 'uninitialized'; use XML::Rules; my %written; my %depends; my $parser = XML::Rules->new( style => 'filter', rules => { _default => 'raw', 'name,members' => 'raw extended', 'Group' => 'raw extended array', record => sub { my @I_depend; if (exists $_[1]->{':members'} and exists $_[1]->{':member +s'}{':Group'}) { @I_depend = grep !exists $written{$_}, map $_->{_conte +nt}, @{$_[1]->{':members'}{':Group'}}; } if (@I_depend) { $_[1]->{':I_depend'} = {map {$_ => 1} @I_depend}; foreach (@I_depend) { push @{$depends{$_}}, $_[1] } return; } else { my $name = $_[1]->{':name'}{_content}; $written{$name} = 1; my @to_write = $_[1]; if (exists $depends{$name}) { push @to_write, find_dependent($name); } return 'record' => \@to_write; } } } ); sub find_dependent { my $name = shift; my @to_write; foreach my $parent (@{$depends{$name}}) { delete $parent->{':I_depend'}{$name}; if (! %{$parent->{':I_depend'}}) { # if it doesn't depend on a +nything more push @to_write, $parent; push @to_write, find_dependent($parent->{':name'}{_content +}); } } delete $depends{$name}; return @to_write; } $parser->filter(\*DATA); if (%depends) { print STDERR "Unsatisfied dependencies! Some records depend on the +se unknown groups: ", join ", ", keys %depends; }

    Only if all the records depend (directly or indirectly) on the last one, will the script keep at one moment all the data in memory. Otherwise whenever it finds a record with no dependencies or whose all dependencies were already printed, it will print that record and all records that only depended on this one (and those already printed).

      Hi Jenda,

      Well I have been trying to get a handle on this script but I have yet to understand exactly what is is doing other than looking at the first record and determining if it has dependencies in other records still existing in the @I_depend array, if not it prints the record. It works well but I have still having trouble in that it prints out "Unsatisfied dependencies" even though the records in the original xml are present. I am trying to understand your last comments where you state and I quote, "Only if all the records depend (directly or indirectly) on the last one, will the script keep at one moment all the data in memory." because I can't seem to understand exactly what this means in terms of the original xml and the order in which the records may be arranged.

      If you have the time can you help me understand the code by possibly adding some comments around the heart of the code? I have been reading the XML::Rules documentation but it is still above my head.

      And out of the 1992 records 80 return in error as "Unsatisfied dependencies" but when I look at the original xml I can't see any obvious differences in the records that do work or a reason why they fail to work with the script.

      One possibility that may exist is that many if not all of these 80 failed records have very like names of other records. Is it possible that the grep to check for the NOT EXIST is somehow not being a specific as it needs to be? I am really trying to understand this script as it has so many things I have never used in any of my scripts, I have only been doing perl for a couple years and I can do basic stuff but your script goes beyond my base comprehension that I would really like to GET IT and understand.

      Example Record Names

      <name>cicoutage</name> <name>cic</name> <name>actpage4_ecc</name> <name>actpage4</name> <name>actpage</name>
      Thanks for all your help,

      Danny

        The error message lists the missing groups, do they exist in the original XML file? Look for <name>such_id</name>. If they don't, your have an inconsistency in your file, if they do, I have a bug in the code. Both are possible.

        If you change the last four lines of the script to

        if (%depends) { print STDERR "\nUnsatisfied dependencies!\n"; foreach my $missing (keys %depends) { print "Missing group '$missing' is a member of:\n"; foreach my $failed (@{$depends{$missing}}) { print " $failed->{':name'}{_content}\n"; } } }
        you will get a little more info regarding the problem.

        I'll try to dissect the code a bit for you. I understand XML::Rules is a bit hard to get your head around. It fits my twisted brain perfectly, but I find it hard to explain. The basic idea behind the module was .. what if we converted XML into executable code? Something like

        <root> <sub>value</sub> <some><x>45</x><y>77</y></some> </root>
        ==>
        root( sub('value'), some( x(45), y(77)) )
        That is we would convert each tag into a function call and make the result of the call available to the function into which we converted the parent tag.

        Of course it's not as simple as that with attributes and content and with the functions sometimes needing to insert something into what's passed to the outer function as the content and sometimes add some attributes, but this is the basic concept. For each tag specify what to do with its data and add the result to the data of the parent tag.

        There are some actions that are so simple and often needed that I added some builtin rules so that you don't have to write for example

        'name' => sub { return 'name' => $_[1]->{_content}}, or 'name' => sub { my ($tag,$attr) = @_; return 'name' => $attr->{_content} },
        if you are interested only in the content of the <name> tag and wanted it to be available as "name" attribute in the data of its parent tag, but instead you can write just
        'name' => 'content',

        Now this all works great if you want to extract data from the XML, restricting and tweaking the data as it's being parsed, but if you want to filter the XML and end up with something very similar you run into problems. Eg. in the example above the datastructure for the <name>'s parent tag will look exactly the same for these two snippets of XML:

        <parent x="1"><name>jenda</name></parent> and <parent x="1" name="jenda"/>
        and if you output that datastructure you actually end up with the later (because it's shorter). If you want to make sure the "name" is output as a tag you have to build a datastructure that'll make clear that it can't be an attribute. But that would make it hard(er) to use within the function that's evaluated for the parent tag. So apart from the 'raw' builtin that leaves the data of the tag in the "content" of the parent tag such that when printed it will look exactly the same as the original (apart from the order of attributes which is irrelevant according to XML specs) there are also 'raw extended' and 'raw extended array'. Both leave the data in the "content" so that they can be printed, but also add an "attribute" named after the tag with a colon at the beginning containg the tags data. To allow you to access them.

        Sounds awfull, but it means that if you set the rule of <name> to 'raw extended' and the XML looks like this

        <parent x="1">blah <name>jenda</name> blah</parent>
        then the parent() will receive a datastructure like this
        { x => "1", ":name" => {_content => "jenda"}, _content => [ "blah ", ["name" => {_content => "jenda"}], " blah"] }
        So you can access the value of the <name> by $attr->{':name'){_content}, but if you need you can convert the data to XML that'll look exactly like the original.

        So, the first three rules make sure the data passed to the <record> handler contain all the info so that we can print that tag with content exactly how it was and at the same time that the data I need are easy to access. The name of the group is in $attr->{':name'}{_content} and the member groups are in @{$attr->{':members'}{':Group'}}.

        The first thing the handler does is that it finds out whether the currently processed group depends on any group we haven't written into the output file yet. It checks whether it depends on anything and if it does, it takes the array of hashrefs referenced by $attr->{':members'}{':Group'}, from each hashref takes just the value of the {_content} key and removes those that were already written.

        If there are any such yet-unsatisfied dependencies, it records them in the tag's datastructure under the key {':I_depend'} (converting the list of names to a hash whose keys are the items of that list) and adds a reference to this tag's data to the %depends hash under each name of group it depends on. And returns nothing. Because we do not want to print anything at the moment.

        If we find out that there are no unsatisfied dependencies we note the record name in the %written hash, find the other records that depended on the current one (or on some record that depended on the current record or ...) and then we return the current record and all the records we found so that XML::Rules can write them to the output.

        Finding the dependent record is a recursive function, we take the name of the record we are just about to write and process all records that depend on it (from $depends{$that_name}), for each remove the name from its {':I_depend'} and if it was the last record that one depended on, we add it to the list to write and process its dependencies.

        I hope I did not make things more confused by this terribly long post :-) It might help to add a few print Dumper(something) throughout the code to understand what data are where at what point.

        Update: FOUND IT! There indeed was a problem in my code. One missing line. Please add $written{$parent->{':name'}{_content}} = 1; after the push @to_write, $parent; line. Silly mistake. I did not mark "written" those groups that were written after the group they depended on, only the groups that were written immediately, without waiting for another group to be parsed, were marked. Anyway there still is one unsatisfied dependency in your XML. The WTSWestTier1 depends on ThinClientOncall_Group and that one is nowhere to be found.