Beefy Boxes and Bandwidth Generously Provided by pair Networks
Come for the quick hacks, stay for the epiphanies.
 
PerlMonks  

xml_split - split huge XML documents into smaller chunks

by mirod (Canon)
on Feb 10, 2005 at 13:36 UTC ( [id://429707]=CUFP: print w/replies, xml ) Need Help??

It looks like the question arises frequently here, so here is my attempt at solving that problem, with a tool that tries not to use to much memory.

This code requires XML::Twig 3.16 and above (available from The XML::Twig page at the time of this writing). It will be bundled with XML::Twig 3.16 and above, so do not look for improved versions here.

Usage is quite simple:xml_split foo.xml will generate foo-01.xml..foo-nn.xml, 1 file per child of the document root, plus foo-00.xml which contains just the root and a processing instruction per file generated, so xml_merge can rebuild the original document. See other fascinating options by doing xml_split -m.

Once you've processed the various parts of the original document, you might want to merge them all back, in which case you might want to have a look at xml_merge.

#!/usr/bin/perl -w # $Id: xml_split,v 1.5 2005/02/10 11:45:07 mrodrigu Exp $ use strict; use XML::Twig; use FindBin qw( $RealBin $RealScript); use Getopt::Std; $Getopt::Std::STANDARD_HELP_VERSION=1; # to stop processing after --he +lp or --version use vars qw( $VERSION $USAGE); $VERSION= "0.02"; $USAGE= "xml_split [-l <level> | -c <cond>] [-b <base>] [-n <nb>] [-e +<ext>] [-d] [-v] [-h] [-m] [-V] <files>\n"; { # main block my $opt={}; getopts('l:c:b:n:e:dvhmV', $opt); # defaults $opt->{n} ||= 2; # number of digits used for creating parts if( $opt->{h}) { die $USAGE, "\n"; } if( $opt->{m}) { exec "pod2text $RealBin/$RealScript"; } if( $opt->{V}) { print "xml_split version $VERSION\n"; exit; } if( $opt->{c}) { die "cannot use --level and --condition at the same t +ime\n" if( $opt->{l}); } else { $opt->{l} ||= 1; $opt->{c}= "level( $opt->{l})"; } my $options= { cond => $opt->{c}, base => $opt->{b}, nb_digits => $opt +->{n}, ext => $opt->{e}, verbose => $opt->{v}, no_pi => $opt->{d} }; my $state; $state->{seq_nb}=0; if( !@ARGV) { $options->{base} ||= 'out'; $options->{ext} ||= '.xml'; my $twig_options= twig_options( $options); my $t= XML::Twig->new( %$twig_options); $t->parse( \*STDIN); end_file( $t, $options, $state); } else { foreach my $file (@ARGV) { unless( $options->{base}) { $state->{seq_nb}=0; } my( $base, $ext)= ($file=~ m{^(.*?)(\.\w+)?$}); $options->{base} ||= $base; $options->{ext} ||= $ext || '.xml'; my $twig_options= twig_options( $options, $state); my $t= XML::Twig->new( %$twig_options); $t->parsefile( $file); end_file( $t, $options, $state); } } } sub twig_options { my( $tool_options, $state)= @_; # base options, ensures maximun fidelity to the original docum +ent my $twig_options= { keep_encoding => 1, keep_spaces => 1 }; # prepare output to the main document unless( $tool_options->{no_pi}) { my $file_name= file_name( $tool_options, { %$state, seq_nb + => 0} ); # main file name warn "generating main file $file_name\n" if( $tool_opt +ions->{verbose}); open( my $out, '>', $file_name) or die "cannot create +main file '$file_name': $!"; $state->{out}= $out; $twig_options->{twig_print_outside_roots}= $out; $twig_options->{start_tag_handlers}= { $tool_options-> +{cond} => sub { $_->set_att( '#in_fragment' => 1); } }; } $twig_options->{twig_roots}= { $tool_options->{cond} => sub { dump +_elt( @_, $tool_options, $state); } }; return $twig_options; } sub dump_elt { my( $t, $elt, $options, $state)= @_; $state->{seq_nb}++; my $file_name= file_name( $options, $state); warn "generating $file_name\n" if( $options->{verbose}); my $fragment= XML::Twig->new(); $fragment->{twig_xmldecl} = $t->{twig_xmldecl}; $fragment->{twig_doctype} = $t->{twig_doctype}; $fragment->{twig_dtd} = $t->{twig_dtd}; if( !$options->{no_pis}) { # if we are still witin a fragment, just replace the element b +y the PI # otherwise print it to the main document my $subdocs= $elt->att( '#has_subdocs') || 0; my $pi= XML::Twig::Elt->new( '#PI') ->set_pi( merge => " subdocs = $subdocs +:$file_name"); $elt->del_att( '#in_fragment'); if( $elt->inherited_att( '#in_fragment')) { $elt->parent( '*[@#in_fragment="1"]')->set_att( '# +has_subdocs' => 1); $pi->replace( $elt); } else { $elt->cut; $pi->print( $state->{out}); } } else { $elt->cut; } $fragment->set_root( $elt); open( my $out, '>', $file_name) or die "cannot create output file +'$file_name': $!"; $fragment->print( $out); close $out; } sub end_file { my( $t, $options, $state)= @_; unless( $options->{no_pi}) { close $state->{out}; } } sub file_name { my( $options, $state)= @_; my $nb= sprintf( "%0$options->{nb_digits}d", $state->{seq_nb}); my $file_name= "$options->{base}-$nb$options->{ext}"; return $file_name; } # for Getop::Std sub HELP_MESSAGE { return $USAGE; } sub VERSION_MESSAGE { return $VERSION; } __END__ =head1 NAME xml_split - cut a big XML file into smaller chunks =head1 DESCRIPTION C<xml_split> takes a (presumably big) XML file and split it in several + smaller files. The memory used is the memory needed for the biggest chunk (ie +memory is reused for each new chunk). It can split at a given level in the tree (the default, splits childre +n of the root), or on a condition (using the subset of XPath understood by XML::Twig, so C<section> or C</doc/section>). Each generated file is replaced by a processing instruction that will +allow C<xml_merge> to rebuild the original document. The processing instruct +ion format is C<< <?merge subdocs=[01] :<filename> ?> >> File names are <file>-<nb>.xml, with <file>-00.xml holding the main do +cument. =head1 OPTIONS =over 4 =item -l <level> level to cut at: 1 generates a file for each child of the root, 2 for +each grand child defaults to 1 =item -c <condition> generate a file for each element that passes the condition xml_split -c <section> will put each C<section> element in its own fil +e (nested sections are handled too) =item -b <name> base name for the output, files will be named <base>-<nb><.ext> <nb> is a sequence number, see below C<--nb_digits> <ext> is an extension, see below C<--extension> defaults to the original file name (if available) or C<out> (if input +comes from the standard input) =item -n <nb> number of digits in the sequence number for each file if more digits than <nb> are needed, then they are used: if C<--nb_dig +its 2> is used and 112 files are generated they will be named C<< <file>-01.xml >> to + C<< <file>-112.xml >> defaults to 2 =item -e <ext> extension to use for generated files defaults to the original file extension or C<.xml> =item -v verbose output =item -V outputs version and exit =item -h short help =item -m man (requires pod2text to be in the path) =back =head1 EXAMPLES xml_split foo.xml # split at level 1 xml_split -l 2 foo.xml # split at level 2 xml_split -c section foo.xml # a file is generated for each section + element # nested sections are split properly =head1 SEE ALSO XML::Twig, xml_merge =head1 TODO =over 4 =item test At the moment this is really alpha code, tested only on small, simple documents. It would be a good idea to first check that indeed the whole document +is not loaded in memory! =item optimize the code any idea welcome! I have already implemented most of what I thought wo +uld improve performances. =item provide other methods that PIs to keep merge information XInclude is a good candidate. using entities, which would seem the natural way to do it, doesn't work, as they make it impossible to have both the main documen +t and the sub docs to be well-formed if the sub docs include sub-sub doc +s (you cant have entity declarations in an entity) =back =head1 AUTHOR Michel Rodriguez <mirod@cpan.org> =head1 LICENSE This tool is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

Replies are listed 'Best First'.
Re: xml_split - split huge XML documents into smaller chunks
by grantm (Parson) on Feb 11, 2005 at 08:11 UTC

      OK, I get it, let me run the tests on all my machines here and I will upload 3.16 later today ;--)

      BTW if someone could test it on Windows, I would appreciate, I don't have any Win32 machine for testing at the moment. If someone could also check Bad newline interpretation by XML-Twig on Windows that would be even better.

      Thanks

        On using splitting and merging on Win.. After merging the xml compared with the original one gave: Bad newline interpretaion in the first (foo-00.xml) if this is about 30Kb of size. The smaller chunks, have no newline, but:

        from Twig :

        { <sometag></sometag> </end_tag_before_splitingone> }

        right one :

        { <sometag><![CDATA[]]></sometag> </end_tag_before_splitingone> }

        updated 2005-02-11 by mirod: added tags

      Hi, Instead of putting each section in to a each file, can I put some first 1000 sections in one file, and next 1000 in other file and so on. When there are 16000 sections so many files getting created in hard to handle them.

        Sorry I did not see this follow-up. Yes you can. In recent versions of xml_split, the -g or the -s options should give you what you need:

          -s <size>
               generates files of (approximately) <size>. The 
               content of each file is enclosed 
               in a new element ("xml_split::root"), so it’s 
               well-formed XML.  The size can be given in bytes,
               Kb, Mb or Gb.
        
           -g <nb>
               groups <nb> elements in a single file. The content
              of each file is enclosed in a new element 
             ("xml_split::root"), so it’s well-formed XML.
        
Re: xml_split - split huge XML documents into smaller chunks
by Anonymous Monk on May 30, 2007 at 15:58 UTC
    I'm thinking about using this. Right now I've got a 1gig XML file with about 20,000 different "sections" that I would want to split into individual files. What I would like to be able to do is say split this XML file on this node, and name all of the generated XML files the id of that node...so here is an example... <Root> <section id=1> <section id=2> <section id=3> </Root> I would want the split to generate a 1.xml, 2.xml and 3.xml in this case. Probably a pretty easy modification, but I just thought I'd throw that out there for everyone.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: CUFP [id://429707]
Approved by gellyfish
Front-paged by grinder
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others avoiding work at the Monastery: (5)
As of 2024-04-19 03:42 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found