Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Re: Multiple XML files from Directory to One XML file using perl.

by graff (Chancellor)
on Nov 19, 2011 at 03:22 UTC ( #938934=note: print w/ replies, xml ) Need Help??


in reply to Multiple XML files from Directory to One XML file using perl.

I suppose that if you were to make up a tag name to use as the one single container for all your existing xml files, it would be a pretty simple matter, and probably wouldn't even involve xml parsing at all. You just need to make sure that the new tag name that you make up does not already occur as a tag in any of the existing xml files.

It's good that you already solved the part about finding all the files -- I'll use the OP code as a starting point (thanks for that), and reduce it down to just the essentials:

#!/usr/lib/perl use strict; use warnings; use Carp; use File::Find; use File::Spec::Functions qw( canonpath ); if ( @ARGV == 0 ) { push @ARGV, "C:/file/dir"; warn "Using default path $ARGV[0]\n Usage: $0 path ...\n"; } # open an output file whose name won't be found by File::Find open( my $allxml, '>', "all_xml_contents.combined" ) or die "can't open output xml file for writing: $!\n"; print $allxml '<?xml version="1.0" encoding="UTF-8"?>', "\n<all_xml_co +ntents>\n"; find( sub { return unless ( /[.]xml\z/i and -f ); extract_information(); return; }, @ARGV ); print $allxml "</all_xml_contents>\n"; sub extract_information { my $path = $_; if ( open my $xmlin, '<', $path ) { local $_ = <$xmlin>; print $allxml $_ unless ( /<\?xml/ ); while ( <$xmlin> ) { print $allxml $_; } } return; }
The point is that, since each input xml file is a fully self-contained element, and you probably don't want to disrupt that structure, all you need is to create a novel tag that won't get confused with any existing content, and use that as the one element that will contain everything else being put into the new file. Just drop the initial <?xml...?> line from each input file. (I've seen a lot of "xml" files that don't start with that, so I think it's worthwhile to check.)

Other things I changed in the code were:

  • fixed how @ARGV is handled -- don't put "my" in front of that, and use it the way it was meant to be used.
  • removed modules you didn't really need (Data::Dumper, XML::LibXML, File::Spec::Functions)
  • removed variables and statements you didn't need (using $_ more)
  • (update:) only use -f on items whose path/file names end in ".xml" (could probably skip -f altogether)
Now, this doesn't handle the problem of removing duplicate xml content, but that's something that will be a lot easier to do after you've written the one big xml file. That's where a good parsing module (like XML::LibXML) will come in very handy.

If your duplication problem is really just a matter of the (exact) same xml content showing up in multiple files (e.g. "foo1.xml" is a copy of "foo2.xml", or "blah1/foo.xml" is a copy of "blah2/foo.xml"), you can simply get md5 signatures of all the files first, sort by md5 values, and look for duplicates that way (files with identical content will have identical md5 values).

But if the duplication problem involves elements that make up parts of files, then a parser is the only way to go, and you'll need to know enough about the data to figure out which elements need to be checked for duplicate content. If you know which tags to look at, running a parser on the "all-combined" xml will make it easy to find and remove the duplicates.


Comment on Re: Multiple XML files from Directory to One XML file using perl.
Select or Download Code
Re^2: Multiple XML files from Directory to One XML file using perl.
by jyo (Initiate) on Nov 21, 2011 at 09:34 UTC

    Hi graff, Thanks for your reply I tried your code, It executing but it prints all xml files into one file, like this

    <?xml version="1.0" encoding="UTF-8"?> <shiporder xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> <shipto> <name>johan</name> <address>Langgt 23</address> <------more info--------> </shipto> </shiporder> <?xml version="1.0" encoding="UTF-8"?> <shiporder xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> <shipto> <name>benny</name> <address>galve 23</address> <------more info--------> </shipto> </shiporder> <?xml version="1.0" encoding="UTF-8"?> <shiporder xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> <shipto> <name>kent</name> <address>vadrss 25</address> <------more info--------> </shipto> </shiporder>

    How can I eliminate This tags

    <?xml version="1.0" encoding="UTF-8"?> <shiporder xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">

    It printing how many files is there that much times in final xml file.Can you tel me how could I eliminate this and print like this

    <?xml version="1.0" encoding="UTF-8"?> <shiporder xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> <shipto> <name>johan</name> <address>Langgt 23</address> <------more info--------> </shipto> <shipto> <name>benny</name> <address>galve 23</address> <------more info--------> </shipto> <shipto> <name>kent</name> <address>vadrss 25</address> <------more info--------> </shipto> </shiporder>

    because every XML file starts with same tag so we need to eliminate that one, final xml contains only one tag name.please can you help me

      FIrst, I don't understand why the <?xml ...?> lines from all the input files are being included in the single output file -- when I use my code as posted, it removes those from each input. Either you're running something different from what I posted, or else there's something odd about the <?xml... lines in your data files.

      As for what you really want, which is one <shiporder ...> element containing all the content of all the files (that is, combining the "shipto" elements from all the input files into one "shiporder"), that's a different plan from what I was suggesting, and it would be best to use a parser for that.

      In fact, it seems like the OP code is really pretty close to what you want. Here's my version, with Digest::MD5 thrown in to eliminate duplicate "shipto" content:

      #!/usr/lib/perl use strict; use warnings; use Carp; use File::Find; use File::Spec::Functions qw( canonpath ); use XML::LibXML::Reader; use Digest::MD5 'md5'; if ( @ARGV == 0 ) { push @ARGV, "C:/file/dir"; warn "Using default path $ARGV[0]\n Usage: $0 path ...\n"; } # open an output file whose name won't be found by File::Find open( my $allxml, '>', "all_shiporders.xml.combined" ) or die "can't open output xml file for writing: $!\n"; print $allxml '<?xml version="1.0" encoding="UTF-8"?>', "\n<shiporder xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instanc +e\">\n"; my %shipto_md5; find( sub { return unless ( /[.]xml\z/i and -f ); extract_information(); return; }, @ARGV ); print $allxml "</shiporder>\n"; sub extract_information { my $path = $_; if ( my $reader = XML::LibXML::Reader->new( location => $path )) { while ( $reader->nextElement( 'shipto' )) { my $elem = $reader->readOuterXml(); my $md5 = md5( $elem ); print $allxml $reader->readOuterXml() unless ( $shipto_md5 +{$md5}++ ); } } return; }
      That seems to work on a set of files such as the following, leaving out "j4.xml" because it's identical to "j2.xml": Here's the output -- the only difference between this and what you wanted is the absence vs. presence of extra line-feeds around the "shipto" tags, which is just a matter of cosmetics:
      <?xml version="1.0" encoding="UTF-8"?> <shiporder xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> <shipto> <name>johan</name> <address>Langgt 23</address> </shipto><shipto> <name>benny</name> <address>galve 23</address> </shipto><shipto> <name>kent</name> <address>vadrss 25</address> </shipto><shipto> <name>stewart</name> <address>vadrss 25</address> </shipto></shiporder>

        Hi, In the script MD5 is no use because I dont have repeated nodes with same content, I have repeated nodes with one tag element, can you help me with that how to remove node information by searching that tag element.please help me with this problem. I am not able to implement logic.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://938934]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others rifling through the Monastery: (7)
As of 2014-09-30 20:48 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (384 votes), past polls