Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

Multiple XML files from Directory to One XML file using perl.

by jyo (Initiate)
on Nov 18, 2011 at 14:06 UTC ( [id://938833]=perlquestion: print w/replies, xml ) Need Help??

jyo has asked for the wisdom of the Perl Monks concerning the following question:

Hi monks, This Is my first Experience using perl language,I have Multiple xml files in a folder, I need to create multiple XML files into one XML file.example code like this

<!-------soc_foo.xml........!> <?xml version="1.0" encoding="UTF-8"?> <shiporder xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> <shipto> <name>johan</name> <address>Langgt 23</address> <------more info--------> </shipto> </shiporder> <!-------tmm_foo.xml........!> <?xml version="1.0" encoding="UTF-8"?> <shiporder xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> <shipto> <name>benny</name> <address>galve 23</address> <------more info--------> </shipto> </shiporder> <!-------svr_foo.xml........!> <?xml version="1.0" encoding="UTF-8"?> <shiporder xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> <shipto> <name>kent</name> <address>vadrss 25</address> <------more info--------> </shipto> </shiporder>

Each xml file have same root and nodes with different data(name, address are different).I need to combine multile XML file into one XML file, my output look like this.

<!-------new.xml........!> <?xml version="1.0" encoding="UTF-8"?> <shiporder xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> <shipto> <name>johan</name> <address>Langgt 23</address> <------more info--------> </shipto> <shipto> <name>benny</name> <address>galve 23</address> <------more info--------> </shipto> <shipto> <name>kent</name> <address>vadrss 25</address> <------more info--------> </shipto> </shiporder>

I Find the all xml file from directory using File::find But How can I Add all XML files data into one xml file without repeating same data. Its looking easy for You but I am very new to perl,help me with this problem. This is my first post, if any mistakes please excuse me.

For the above question I tried like this

Hi, I tried like this by using XML::LibXML::Reader

#!/usr/bin/perl use warnings; use strict; use Carp; use File::Find; use File::Spec::Functions qw( canonpath ); use XML::LibXML::Reader; use Data::Dumper; my $INFO; my @ARGV ="C:/file/dir"; die "Need directories\n" unless @ARGV; find( sub { my $file = $_; #my $path = canonpath $File::Find::name; my $path =$_; return unless -f $path; return unless $file =~ /[.]xml\z/i; extract_information($path); return; }, @ARGV ); sub extract_information { my( $path)=@_; my $ret = open my $xmlin, '<', $path; unless ($ret) { carp "Cannot open '$path': $!"; return; } my $reader = XML::LibXML::Reader->new(IO => $xmlin); unless ($reader) { carp "Cannot create reader using '$path'"; return; } while ($reader->nextElement('shipto')) { $INFO = $reader->readOuterXml(); print "$INFO\n"; } close $xmlin or carp "Cannot close '$path': $!"; return; }

but I have two problem in this script

1) I am extracting information from all XML files Having "shiporder" Node element, But in one XML file I have data with some other Node element "definition" I am not extracting that information, What should I do if I want to extract that information and store in the same variable.

2) After extracting all information That is stored in a $INFO varible, I want to store that $INFO variable information in one xml file how can I do that one. Please help me.

Thanks in advance

jyo

Replies are listed 'Best First'.
Re: Multiple XML files from Directory to One XML file using perl.
by choroba (Cardinal) on Nov 18, 2011 at 14:52 UTC
    I usually use XML::XSH2 for XML manipulation. In this case, I'd use something like this:
    $new := create new; for { glob "folder/*.xml" } { open (.) ; if $new/new { # for the first file, copy the root as well cp shiporder replace $new/new ; } else { cp shiporder/shipto append $new/shiporder ; } }
Re: Multiple XML files from Directory to One XML file using perl.
by graff (Chancellor) on Nov 19, 2011 at 03:22 UTC
    I suppose that if you were to make up a tag name to use as the one single container for all your existing xml files, it would be a pretty simple matter, and probably wouldn't even involve xml parsing at all. You just need to make sure that the new tag name that you make up does not already occur as a tag in any of the existing xml files.

    It's good that you already solved the part about finding all the files -- I'll use the OP code as a starting point (thanks for that), and reduce it down to just the essentials:

    #!/usr/lib/perl use strict; use warnings; use Carp; use File::Find; use File::Spec::Functions qw( canonpath ); if ( @ARGV == 0 ) { push @ARGV, "C:/file/dir"; warn "Using default path $ARGV[0]\n Usage: $0 path ...\n"; } # open an output file whose name won't be found by File::Find open( my $allxml, '>', "all_xml_contents.combined" ) or die "can't open output xml file for writing: $!\n"; print $allxml '<?xml version="1.0" encoding="UTF-8"?>', "\n<all_xml_co +ntents>\n"; find( sub { return unless ( /[.]xml\z/i and -f ); extract_information(); return; }, @ARGV ); print $allxml "</all_xml_contents>\n"; sub extract_information { my $path = $_; if ( open my $xmlin, '<', $path ) { local $_ = <$xmlin>; print $allxml $_ unless ( /<\?xml/ ); while ( <$xmlin> ) { print $allxml $_; } } return; }
    The point is that, since each input xml file is a fully self-contained element, and you probably don't want to disrupt that structure, all you need is to create a novel tag that won't get confused with any existing content, and use that as the one element that will contain everything else being put into the new file. Just drop the initial <?xml...?> line from each input file. (I've seen a lot of "xml" files that don't start with that, so I think it's worthwhile to check.)

    Other things I changed in the code were:

    • fixed how @ARGV is handled -- don't put "my" in front of that, and use it the way it was meant to be used.
    • removed modules you didn't really need (Data::Dumper, XML::LibXML, File::Spec::Functions)
    • removed variables and statements you didn't need (using $_ more)
    • (update:) only use -f on items whose path/file names end in ".xml" (could probably skip -f altogether)
    Now, this doesn't handle the problem of removing duplicate xml content, but that's something that will be a lot easier to do after you've written the one big xml file. That's where a good parsing module (like XML::LibXML) will come in very handy.

    If your duplication problem is really just a matter of the (exact) same xml content showing up in multiple files (e.g. "foo1.xml" is a copy of "foo2.xml", or "blah1/foo.xml" is a copy of "blah2/foo.xml"), you can simply get md5 signatures of all the files first, sort by md5 values, and look for duplicates that way (files with identical content will have identical md5 values).

    But if the duplication problem involves elements that make up parts of files, then a parser is the only way to go, and you'll need to know enough about the data to figure out which elements need to be checked for duplicate content. If you know which tags to look at, running a parser on the "all-combined" xml will make it easy to find and remove the duplicates.

      Hi graff, Thanks for your reply I tried your code, It executing but it prints all xml files into one file, like this

      <?xml version="1.0" encoding="UTF-8"?> <shiporder xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> <shipto> <name>johan</name> <address>Langgt 23</address> <------more info--------> </shipto> </shiporder> <?xml version="1.0" encoding="UTF-8"?> <shiporder xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> <shipto> <name>benny</name> <address>galve 23</address> <------more info--------> </shipto> </shiporder> <?xml version="1.0" encoding="UTF-8"?> <shiporder xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> <shipto> <name>kent</name> <address>vadrss 25</address> <------more info--------> </shipto> </shiporder>

      How can I eliminate This tags

      <?xml version="1.0" encoding="UTF-8"?> <shiporder xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">

      It printing how many files is there that much times in final xml file.Can you tel me how could I eliminate this and print like this

      <?xml version="1.0" encoding="UTF-8"?> <shiporder xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> <shipto> <name>johan</name> <address>Langgt 23</address> <------more info--------> </shipto> <shipto> <name>benny</name> <address>galve 23</address> <------more info--------> </shipto> <shipto> <name>kent</name> <address>vadrss 25</address> <------more info--------> </shipto> </shiporder>

      because every XML file starts with same tag so we need to eliminate that one, final xml contains only one tag name.please can you help me

        FIrst, I don't understand why the <?xml ...?> lines from all the input files are being included in the single output file -- when I use my code as posted, it removes those from each input. Either you're running something different from what I posted, or else there's something odd about the <?xml... lines in your data files.

        As for what you really want, which is one <shiporder ...> element containing all the content of all the files (that is, combining the "shipto" elements from all the input files into one "shiporder"), that's a different plan from what I was suggesting, and it would be best to use a parser for that.

        In fact, it seems like the OP code is really pretty close to what you want. Here's my version, with Digest::MD5 thrown in to eliminate duplicate "shipto" content:

        #!/usr/lib/perl use strict; use warnings; use Carp; use File::Find; use File::Spec::Functions qw( canonpath ); use XML::LibXML::Reader; use Digest::MD5 'md5'; if ( @ARGV == 0 ) { push @ARGV, "C:/file/dir"; warn "Using default path $ARGV[0]\n Usage: $0 path ...\n"; } # open an output file whose name won't be found by File::Find open( my $allxml, '>', "all_shiporders.xml.combined" ) or die "can't open output xml file for writing: $!\n"; print $allxml '<?xml version="1.0" encoding="UTF-8"?>', "\n<shiporder xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instanc +e\">\n"; my %shipto_md5; find( sub { return unless ( /[.]xml\z/i and -f ); extract_information(); return; }, @ARGV ); print $allxml "</shiporder>\n"; sub extract_information { my $path = $_; if ( my $reader = XML::LibXML::Reader->new( location => $path )) { while ( $reader->nextElement( 'shipto' )) { my $elem = $reader->readOuterXml(); my $md5 = md5( $elem ); print $allxml $reader->readOuterXml() unless ( $shipto_md5 +{$md5}++ ); } } return; }
        That seems to work on a set of files such as the following, leaving out "j4.xml" because it's identical to "j2.xml": Here's the output -- the only difference between this and what you wanted is the absence vs. presence of extra line-feeds around the "shipto" tags, which is just a matter of cosmetics:
        <?xml version="1.0" encoding="UTF-8"?> <shiporder xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> <shipto> <name>johan</name> <address>Langgt 23</address> </shipto><shipto> <name>benny</name> <address>galve 23</address> </shipto><shipto> <name>kent</name> <address>vadrss 25</address> </shipto><shipto> <name>stewart</name> <address>vadrss 25</address> </shipto></shiporder>
Re: Multiple XML files from Directory to One XML file using perl.
by Anonymous Monk on Nov 18, 2011 at 14:09 UTC

    But How can I Add all XML files data into one xml file without repeating same data.

    Keep track of duplicates, see uniq in perlfaq4

Re: Multiple XML files from Directory to One XML file using perl.
by sundialsvc4 (Abbot) on Nov 18, 2011 at 15:06 UTC

    You are indeed picking an ambitious project for your first experience with the Perl language, but let me give you a website that you must bookmark:   http://search.cpan.org.

    If you go to that site and type, “XML,” then ... oops! ... you get 5,000 hits.   So, here are a couple of searches to have a look at:

    • XML::LibXML
    • XML::Twig
    • XSLT and XPath   (these are more-general search terms)

    Your general approach will be to construct a new XML object representing your output file, and then to cycle through a directory (search e.g. for File::Find), open each one of the XML files you find there, and transfer nodes and subtrees from one to the other.

    You may (erroneously, as it turns out) assume that you must write complicated code to navigate through the structure to see if things already exist, but you don’t:   that is what “XPath expressions” are for.   True XML difference-engines are also available.

    Your strategy will partly depend on how large these files actually are.

    You might also discover that there are altogether different ways to do it, such as XSLT, and existing engines such as Saxon.   In the desktop publishing world, DocBook is used to write documentation and books in an XML-based format, and books are actually assembled from hundreds or thousands of component parts using (a little hand-waving here ... what I am saying is not quite true...) no programming at all.   Because XML is now used so widely and in so many ways, the company of available tools is quite large and mature ... and, “Perl is there.”

      Hi, Can any one provide some example scripts that add multiple XML files into One XML file

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://938833]
Approved by marto
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others contemplating the Monastery: (6)
As of 2024-03-28 22:30 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found