Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

Light batch XML indenter

by diotalevi (Canon)
on May 28, 2003 at 13:31 UTC ( [id://261292]=CUFP: print w/replies, xml ) Need Help??

This is a dirt-basic XML indenter. I wrote it so I could take a directory full of XML files and add newlines after tags and leading whitespace. I made no attempt to do anything comprehensive with this since it handles the simple case I was looking at.

The simple (and white-space free) document

<message><org><cn>Some org-or-other</cn><ph>Wouldn't you like to know</ph></org><contact><fn>Pat</fn><ln>Califia</ln></contact></message>

Becomes something nicer to look at

<message> <org> <cn>Some org-or-other</cn> <ph>Wouldn't you like to know</ph> </org> <contact> <fn>Pat</fn> <ln>Califia</ln> </contact> </message>
@files = glob "*.xml"; undef $/; for $file (@files) { $indent = 0; open FILE, $file or die "Couldn't open $file for reading: $!"; $_ = readline *FILE; close FILE or die "Couldn't close $file: $!"; # Remove whitespace between > and < if that is the only thing sepa +rating # them s/(?<=>)\s+(?=<)//g; # Indent s{ # Capture a tag <$1$2$3>, # a potential closing slash $1 # the contents $2 # a potential closing slash $3 <(/?)([^/>]+)(/?)> # Optional white space \s* # Optional tag. # $4 contains either undef, "<" or "</" (?=(</?))? } { # Adjust the indentation level. # $3: A <foo/> tag. No alteration to indentation. # $1: A closing </foo> tag. Drop one indentation level # else: An opening <foo> tag. Increase one indentation level $indent += $3 ? 0 : $1 ? -1 : 1; # Put the captured tag back into place "<$1$2$3>" . # Two closing tags in a row. Add a newline and indent the next + line ($1 and ($4 eq "</") ? "\n" . (" " x $indent) : # This isn't a closing tag but the next tag is. Add a newline +and # indent the next line. $4 ? "\n" . (" " x $indent) : # This isn't a closing tag - no special indentation. I forget +why # this works. "" ) # /g repeat as necessary # /e Execute the block of perl code to create replacement text # /x Allow whitespace and comments in the regex }gex; open FILE, ">", $file or die "Couldn't open $file for writing: $!" +; print FILE or die "Couldn't write to $file: $!"; close FILE or die "Couldn't close $file: $!"; } __END__ This is the version I copied and pasted from the working script. Its ugly so I purtied it up and posted the version you see above this text. I'm leaving this ugly version here just in case I introduced some bug I'm not aware of. @files = glob "*.xml"; undef $/; $tag = "<>/"; for $file (@files) { $indent = 0; open FILE, $file; $_ = <FILE>; s/(?<=>)\s+(?=<)//g; s(<(/?)([^/>]+)(/?)>\s*(?=(</?))?)($indent+=$3?0:$1?-1:1;"<$1$2$3> +".($1&&($4 eq"</")?"\n".(" "x$indent):$4?"\n".(" "x$indent):""))ge; open FILE, ">$file"; print FILE; }

Replies are listed 'Best First'.
•Re: Light batch XML indenter
by merlyn (Sage) on May 29, 2003 at 16:25 UTC
Re: Light batch XML indenter
by vek (Prior) on May 28, 2003 at 14:15 UTC
    diotalevi++

    You've saved me from working on something similar today - thanks. We received a bunch of XML that has no indentation or whitespace that I'll need to 'clean up'. I'll be taking your code out for a spin when I get to work today.

    Thanks again!

    -- vek --

      This is as good a time as any to mention this but this is only something you want to do if whitespace nodes between tags isn't important to you. In my case I don't care about the introduced whitespace as data and it made some debugging easier. This isn't always a valid approach so keep that in mind.

Re: Light batch XML indenter
by mirod (Canon) on Jun 02, 2003 at 19:44 UTC
Re: Light batch XML indenter ( rewritten with Regexp::NamedCaptures )
by diotalevi (Canon) on Sep 19, 2005 at 17:21 UTC

    Just to see if this made things more readable, I substituted all the positional captures with named captures. See Regexp::NamedCaptures, at last for the module code.

    use Regexp::NamedCaptures; @files = glob "*.xml"; undef $/; for $file (@files) { $indent = 0; open FILE, $file or die "Couldn't open $file for reading: $!"; $_ = readline *FILE; close FILE or die "Couldn't close $file: $!"; # Remove whitespace between > and < if that is the only thing sepa +rating # them s/(?<=>)\s+(?=<)//g; # Indent s{ # Capture a tag <$close_tag$name$empty_tag>, # a potential closing slash $close_tag # the contents $name # a potential closing slash $empty_tag <(?<\$close_tag>/?)(?<\$name>[^/>]+)(?<\$empty_tag>/?)> # Optional white space \s* # Optional tag. # $4 contains either undef, "<" or "</" (?=(?<\$next_tag_start></?))? } { # Adjust the indentation level. # $3: A <foo/> tag. No alteration to indentation. # $1: A closing </foo> tag. Drop one indentation level # else: An opening <foo> tag. Increase one indentation level $indent += $empty_tag ? 0 : $close_tag ? -1 : 1; # Put the captured tag back into place "<$close_tag$name$empty_tag>" . # Two closing tags in a row. Add a newline and indent the next + line ($close_tag and ($next_tag_start eq "</") ? "\n" . (" " x $indent) : # This isn't a closing tag but the next tag is. Add a newline +and # indent the next line. $next_tag_start ? "\n" . (" " x $indent) : # This isn't a closing tag - no special indentation. I forget +why # this works. "" ) # /g repeat as necessary # /e Execute the block of perl code to create replacement text # /x Allow whitespace and comments in the regex }gex; open FILE, ">", $file or die "Couldn't open $file for writing: $!" +; print FILE or die "Couldn't write to $file: $!"; close FILE or die "Couldn't close $file: $!"; }

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: CUFP [id://261292]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others admiring the Monastery: (6)
As of 2024-04-25 13:51 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found