Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

Concatenating XML files

by moritz (Cardinal)
on Jul 30, 2007 at 14:20 UTC ( #629568=perlquestion: print w/ replies, xml ) Need Help??
moritz has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks, I have to concatenate xml files, and discard the outer delimiting tags.

So the files might look like this:

<!-- file 1 --> <?xml version="1.0" encoding="UTF-8"?> <foo> <bar1>some data</bar1> <!-- more data here --> </foo> <!--file 2--> <?xml version="1.0" encoding="UTF-8"?> <foo> <bar2>some more data</bar2> <!-- more data here --> </foo> <!-- this is how the resulting file should look like: --> <?xml version="1.0" encoding="UTF-8"?> <foo> <bar1>some data</bar1> <!-- more data here --> <bar2>some more data</bar2> <!-- more data here --> </foo>

My first attempt was to use XML::Twig, but the vast number of methods overwhelmed me, and I couldn't find one that simply returns a text representation of all the sub elements (including markup).

Is there an easy way to do it with XML::Twig or another XML module?

I could certainly use regexes to parse the beginning of the file and then paste it verbatim until the second-to-last line in the file, but that seems a bit ugly, so I'd appreciate better suggestions ;-)

Update: fixed type in title (how embarrassing)

Comment on Concatenating XML files
Download Code
Re: Conatenating XML files
by un-chomp (Scribe) on Jul 30, 2007 at 15:08 UTC
    XML::LibXML to the rescue (again):
    #!/usr/bin/perl use strict; use warnings; use File::Find::Rule; use XML::LibXML; # get a list of target files my $input_folder = 'input'; # whatever my @files = File::Find::Rule->file->in( $input_folder ); # initiate XML parser my $parser = XML::LibXML->new; $parser->expand_entities( 0 ); # leave entities alone # go through target files, collecting elements of interest my @wanted; foreach my $file ( @files ) { # parse XML my $dom = $parser->parse_file( $file ); # input # select and store all top level elements push @wanted, $dom->documentElement->findnodes( './*' ); } # make a new document my $new = XML::LibXML::Document->new( '1.0', 'UTF-8' ); # add root element my $root = XML::LibXML::Element->new( 'xml' ); $new->addChild( $root ); # add the inner elements we've collected $root->addChild( $_ ) for @wanted; # output print $new->toString;
Re: Conatenating XML files
by john_oshea (Priest) on Jul 30, 2007 at 15:17 UTC

    XML::LibXML::Document has a toString() method that converts the parsed DOM into a string, including all the child nodes. It has an optional format parameter, which, when set to zero "...the document is dumped as it was originally parsed".

    Depending on how lazy / strict you're feeling you could either:

    • just serialize each document into a string and use (anchored) regexes to strip off the document root
    • use the documentElement() method to get the root of each document, then loop through its childNodes array, and individually toString() each of those

    I'd be surprised if there isn't an XML::Twig equivalent to that - I'm just personally much more familiar XML::LibXML - hope that helps.

      Thank you john_oshea. Probably there is such a method in XML::Twig, but knowing another module that does the job is fine for me as well.
Re: Conatenating XML files
by mirod (Canon) on Jul 30, 2007 at 16:07 UTC

    The method you are looking for is xml_string, which is also aliased as innerXML.

    #!/usr/bin/perl use strict; use warnings; use XML::Twig; print "<foo>"; foreach my $file ( ' to_concat_1.xml', 'to_concat_2.xml') { print XML::Twig->new( keep_spaces =>1, comments => 'process')->par +sefile( $file)->root->xml_string; } print "</foo>\n";

    Note that the comments => 'process' is used only because you have comments just before the end of the foo element in your example, it is probably not needed in your real code.

    Also a more generic way would be to keep the first document, and then to add the other ones at the end, removing their root:

    #!/usr/bin/perl use strict; use warnings; use XML::Twig; my $result_twig; foreach my $file ( ' to_concat_1.xml', 'to_concat_2.xml') { my $current_twig= XML::Twig->new( comments => 'process')->parsefil +e( $file); if( !$result_twig) { $result_twig= $current_twig; } else { $current_twig->root->move( last_child => $result_twig->root) ->erase; } } $result_twig->print;
      Also a more generic way would be to keep the first document, and then to add the other ones at the end, removing their root: ...

      I'm not sure how much more "generic" your second solution would be... It might not apply very well, I think, in cases where the quantity of xml data to be concatenated, multiplied by the additional memory consumed for DOM data structure storage in perl, could exceed available RAM.

        It's more generic in the sense that it doesn't assume that the root tag of the first document is foo. As you did not mention any constraint on the size of the documents, then I did not assume any.

        If there are contraints, for each individual file or for the resulting file, then you should mention it and the solution would be different. Depending on the constraints in terms of speed and potentialsize of the documents, the best solution could be regexp based (that could be made quite robust, provided your XML files do not include DTDs), XML::LibXML based (if individual files are not too big to be loaded in memory), XML::Parser based (rather easy if no DTD is used, a bit more complicated otherwise) or XML::Twig based (slower, but you could deal with arbitrary sized documents, and that would be quite easy to code, although a bit more complex than the examples I gave previously).

        But all those potential constraints would be part of the requirements for your code, so you would have to express them if you want a response that really fits your problem.

        And no, I am not trying to confuse you to cover up the fact that my answer wasn't that smart ;--)

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://629568]
Approved by Corion
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chilling in the Monastery: (7)
As of 2014-12-20 00:09 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (94 votes), past polls