http://www.perlmonks.org?node_id=576872

Good `localtime` monks,
I was trying to find a better way to extract data from XML and I think I might have an idea. Maybe it's stupid, maybe there already is a module with that interface, but maybe not and maybe others will like the idea as well and there will be a point in implementing it.

The basic theme goes like this ... XML is basicaly a serialized data structure and what we want to get as we are parsing it is basicaly a data structure as well. What we get from the parsers is too generic and too complex (DOM, XML::Parser's Tree style) or too restricted (XML::Simple). Either we end up with a structure that's hard to use or we have to restrict the set of XMLs we can handle ... and end up with a structure that likewise may be more complex than necessary. What might help would be a way to specify the rules by which to transform the individual tags (with their attributes and content) to whatever data structure we need to end up with. And apply the rules from the leaves all the way to the root, either producing a simplified datastructure containing just the stuff we are interested in in a format that's convenient enough or process the partial structures as produced by applying the rules.

An example is worth a thousand words, so here's one

$xml = <<'*END*' <doc> <person> <fname>...</fname> <lname>...</lname> <email>...</email> <address> <street>...</street> <city>...</city> <country>...</country> <bogus>...</bogus> </address> </person> <person> <fname>...</fname> <lname>...</lname> <email>...</email> <address> <street>...</street> <city>...</city> <country>...</country> <bogus>...</bogus> </address> </person> </doc> *END* %rules = ( _default => sub {$_[0] => $_[1]->{_content}}, # by default we are only interested in the content and we want + # the parent to access it as an attribute of the same name as +was the tag bogus => undef, # means "ignore" address => sub {address => "$_[1]->{street}, $_[1]->{city} ($_[1]- +>{country})"}, # let's convert the address to a single string person => sub {'@person' => "$_[1]->{lname}, $_[1]->{fname}\n<$_[1 +]->{email}>\n$_[1]->{address}"} # push the stringified data into the @{$parent->{person}} doc => sub { join( "\n\n", @{$_[1]->{person}})} ); print XML::TransformRules::Parse( $xml, \%rules);
or, a bit more complex
$xml = <<'*END*' <doc> <person> <fname>...</fname> <lname>...</lname> <email>...</email> <address> <street>...</street> <city>...</city> <country>...</country> <bogus>...</bogus> </address> <phones> <phone type="home">123-456-7890</phone> <phone type="office">663-486-7890</phone> <phone type="fax">663-486-7000</phone> </phones> </person> <person> <fname>...</fname> <lname>...</lname> <email>...</email> <address> <street>...</street> <city>...</city> <country>...</country> <bogus>...</bogus> </address> <phones> <phone type="office">663-486-7891</phone> </phones> </person> </doc> *END* %rules = ( _default => sub {$_[0] => $_[1]->{_content}}, bogus => undef, address => sub {address => "$_[1]->{street}, $_[1]->{city} ($_[1]- +>{country})"}, phone => sub {$_[1]->{type} => $_[1]->{content}}, # let's use the "type" attribute as the key and the content as + the value phones => sub {delete $_[1]->{_content}; %{$_[1]}}, # remove the text content and pass along the type => content f +rom the child nodes person => sub { # lets print the values, all the data is readily a +vailable in the attributes print "$_[1]->{lname}, $_[1]->{fname} <$_[1]->{email}>\n"; print "Home phone: $_[1]->{home}\n" if $_[1]->{home}; print "Office phone: $_[1]->{office}\n" if $_[1]->{office}; print "Fax: $_[1]->{fax}\n" if $_[1]->{fax}; print "$_[1]->{address}\n\n"; return; # the <person> tag is processed, no need to remember w +hat it contained }, );

Even though I talked about transforming the data structure there is nothing preventing us from applying the rules as we parse the document as soon as we encounter the closing tag. So we do not have to load the whole document to memory if we don't need it all at once. And even if we do we have a chance to trim it down as we read it and end up with a much smaller data structure.

The rules receive two parameters, the name of the tag and a hash containing the attributes and the content. For leaf nodes the attributes are the tag attributes and the _content is the textual content of the tag, for other tags it's a bit more complex, the data structure contains stuff returned by the rules of the subtags. The rules may return

  1. nothing (empty list, undef of empty string) - nothing gets added to the parent's data structure
  2. a single string - the string gets appended or pushed to the _content of the parent
  3. a single reference - the parent's _content is converted to an array (if necessary) and the reference is pushed there
  4. an even numbered list - add the keys (odd items) and values (even items) to the parent's data structure, if the key starts with '@' push the value at the end of the array referenced by the key (without the '@'). The value may be a reference.
  5. everything else is an error

Hope the explanation makes sense. So the question is, is there something like this already? Does it make sense? Would you be interested in such a module? What parser should I build this on top of? Should it be a separate module or should I rather try to add this to XML::Parser as yet another style?