Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

compacting XML?

by Anonymous Monk
on Nov 30, 2011 at 02:22 UTC ( #940754=perlquestion: print w/ replies, xml ) Need Help??
Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I have XML in human readable format, eg:

<outermost> <innermost> <first>1</first> <second/> <third>These spaces are to be preserved.</third> </innermost> </outermost>
...which I would like to compact to something similar to the following:
<outermost><innermost><first>1</first><second/> <third>These spaces are to be preserved.</third> </innermost></outermost>
Only whitespace occuring within values needs to be preserved. Any whitespace which only separates tags from other tags can be removed.

I naively came up with the following regular expressions which takes care of most possibilities:

# <begin> <begin> $xml =~ s!(<\S+?>)\s+><\S+?>!$1$2!g; # </end> <begin> $xml =~ s!(</\S+?>)\s+><\S+?>!$1$2!g;
...but while I could easily see that there are more possibilities to consider (and these are not completely robust either...), I doubt I am the first to run into this problem.

Is there a CPAN module or canned solution which deals with compacting XML?

Thanks.

Comment on compacting XML?
Select or Download Code
Re: compacting XML?
by GrandFather (Cardinal) on Nov 30, 2011 at 02:38 UTC

    The "kitchen sink" module for manipulating XML is XML::Twig. Consider:

    use strict; use warnings; use XML::Twig; my $xml = <<XML; <outermost> <innermost> <first>1</first> <second/> <third>These spaces are to be preserved.</third> </innermost> </outermost> XML my $twig = XML::Twig->new(); $twig->parse ($xml); $twig->print();

    Prints:

    <outermost><innermost><first>1</first><second/><third>These spaces are + to be preserved.</third></innermost></outermost>
    True laziness is hard work
Re: compacting XML?
by ikegami (Pope) on Nov 30, 2011 at 03:29 UTC

    The other kitchen sink, XML::LibXML.

    use strict; use warnings; use XML::LibXML qw( ); my $xml = <<'XML'; <outermost> <innermost> <first>1</first> <second/> <third>These spaces are to be preserved.</third> </innermost> </outermost> XML my $doc = XML::LibXML->load_xml( string => $xml, no_blanks => 1 ); print $doc->toString();
Re: compacting XML?
by sundialsvc4 (Monsignor) on Nov 30, 2011 at 04:10 UTC

    Uh huh, and with either package the approach is the same:   let some XML-knowledgeable parser hand you the elements one after another in printable form, then output those elements in printable form (as the aforesaid XML-knowledgeable parser has conveniently handed them to you ...) with nothing in-between them.   Instead of monkeying around with regular expressions that attempt to treat the XML as a text-string, you delegate the entire XML-parsing task to, as it were, “someone who knows.™”   Works.++

Re: compacting XML?
by Anonymous Monk on Nov 30, 2011 at 05:03 UTC
    If you happen to be using none standard tag delimiters, like say (tag)data(/tag) or [tag]data[/tag] you would need to use some sort of state-machine. I think I might have a play with writing one of these later.
      Why on earth would you want to use such delimiters?
Re: compacting XML?
by TJPride (Pilgrim) on Nov 30, 2011 at 08:32 UTC
    This should work as long as you don't care about retaining empty strings (<tag> </tag>) and as long as the content itself doesn't contain the < character. A good XML parser will of course also work, but if your data is simple, there may be no need.

    use strict; use warnings; my $xml = join '', <DATA>; while ($xml =~ s/(<\/?.*?>)\s+(<\/?.*?>)/$1$2/sg) {} print $xml; __DATA__ <outermost> <innermost> <first>1</first> <second/> <third>These spaces are to be preserved.</third> </innermost> </outermost>
Re: compacting XML?
by choroba (Abbot) on Nov 30, 2011 at 08:58 UTC
    I usually use XML::XSH2 for XML manipulation. This code removes all the text that is only whitespace (which is not exactly the same as your specification), but the output is the same as of the other solutions.
    for //text() { if xsh:match(., '^\s+$') set . '' }
    Update: This should delete only whitespace that has an element sibling:
    for //text() { if ((following-sibling::* or preceding-sibling::*) and xsh:match(. +, '^\s+$')) set . '' }
Re: compacting XML?
by Lotus1 (Chaplain) on Nov 30, 2011 at 17:46 UTC

    XML::Tidy has a function, strip(), that does exactly what you have requested.

    strip()

    The strip() member function searches the Tidy object for all mixed-content (i.e., non-data) text nodes && empties them out. This will basically unformat any markup indenting.

    strip() is used by compress() && tidy() but it is exposed because it could be worthwhile by itself.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://940754]
Approved by GrandFather
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others examining the Monastery: (4)
As of 2014-09-17 00:35 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (55 votes), past polls