Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

how to strip XML into Plain Text file

by dbrock (Sexton)
on Jan 25, 2005 at 21:33 UTC ( #425049=perlquestion: print w/replies, xml ) Need Help??

dbrock has asked for the wisdom of the Perl Monks concerning the following question:

Hello...

I have an XML file that I am trying to strip back into plain text... Listed Below is an excerpt of the XML data...

I am trying to use something like s/\<.+\>//; to remove all of the the text contained within the < > brackets... The file does not seem to open correctly...

Any advice...?



<?xml version="1.0" encoding="UTF-16"?> <joblog><job_log_version version="1.0"/> <header><filler> ====================================================================== </filler><server>Job server: computername </server><name>Job name: computername - Inc </name><start_time>Job started: Monday, December 27, 2004 at 2:53:38 PM </start_time><type>Job type: Backup </type><log_name>Job Log: BEX00164.xml </log_name><filler> ====================================================================== </filler></header><media_drive_and_media_info> Drive and media information from media mount: <robotic_library_name>Robotic Library Name: COMPAQ 1 </robotic_library_name><drive_name>Drive Name: COMPAQ 1 </drive_name><slot>Slot: 1 </slot><media_label>Media Label: DSW000 </media_label><media_guid>Media GUID: {431B03DE-1C49-11D4-B21C-00508BCA3A68} </media_guid><media_overwrite_date>Overwrite Protected Until: 1/30/2005 3:14:41 AM </media_overwrite_date><media_append_date>Appendable Until: 12/31/9999 12:00:00 AM </media_append_date><media_set_target>Targeted Media Set Name: Daily </media_set_target></media_drive_and_media_info><backup><filler> ====================================================================== </filler><title>Job Operation - Backup </title><append_or_overwrite>Media operation - append. </append_or_overwrite><compression>Hardware compression enabled. </compression><filler>



I want the output to look like this





====================================================================== Job server: computername Job name: computername - Inc Job started: Monday, December 27, 2004 at 2:53:38 PM Job type: Backup Job Log: BEX00164.xml ====================================================================== Drive and media information from media mount: Robotic Library Name: COMPAQ 1 Drive Name: COMPAQ 1 Slot: 1 Media Label: DSW000 Media GUID: {431B03DE-1C49-11D4-B21C-00508BCA3A68} Overwrite Protected Until: 1/30/2005 3:14:41 AM Appendable Until: 12/31/9999 12:00:00 AM Targeted Media Set Name: Daily ====================================================================== Job Operation - Backup Media operation - append. Hardware compression enabled. ======================================================================


Thank you for any help...
DBrock...

Dozens of br tags throughout input/output examples replaced by a set of code tags, to eliminate issues with long-line horizontal scrolling, by davido.
Edit by castaway - HTML entities turned back into literal characters, to make sense with code tags

Replies are listed 'Best First'.
Re: how to strip XML into Plain Text file
by Aristotle (Chancellor) on Jan 25, 2005 at 22:27 UTC

    I'm turning into the resident XSLTmonk.

    Feed this stylesheet into whichever XSLT processor is handy, along with your file:

    <?xml version="1.0" encoding="utf-8"?> <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Tr +ansform"> <xsl:output method="text" encoding="us-ascii" /> </xsl:stylesheet>

    This is minimal because in your case the XSLT processor defaults (recurse full tree, printing text nodes) are a perfect fit.

    For some Perl code driving an XSL transform, see my example elsewhere from earlier today: Re: MathML 2 ascii?. But if you can install those modules you probably have a commandline XSLT processor installed anyway.

    Makeshifts last the longest.

Re: how to strip XML into Plain Text file
by davido (Cardinal) on Jan 25, 2005 at 21:41 UTC

    Using XML::Simple you can dump the XML into a Perl datastructure. From that point, formatting it for output should be simple. I've used XML::Simple with the PerlMonks::Mechanized project, among other little projects, and have found it to be easy and pretty straightforward. DO read its documentation though, because there are some settings that can be used to achieve a more useful datastructure dump, and these settings are not obvious without reading the docs.


    Dave

Re: how to strip XML into Plain Text file
by borisz (Canon) on Jan 25, 2005 at 23:59 UTC
    perl script.pl file.xml
    package MySAXHandler; use base 'XML::SAX::Base'; sub characters { print $_[1]->{Data}; } package main; use XML::SAX; XML::SAX::ParserFactory->parser( Handler => MySAXHandler->new ) ->parse_uri(shift);
    Boris
Re: how to strip XML into Plain Text file
by sleepingsquirrel (Hermit) on Jan 26, 2005 at 00:21 UTC
    perl -p -e 's/<[^>]*>//g' <foo.xml


    -- All code is 100% tested and functional unless otherwise noted.

      ... <img alt="Next >>" src="../next_button.jpg" />*Boom*

      And this is why you use a real parser, not just a regex . . .

      Update: Just to clarify the above is a pathological case and if you're reasonably sure that it probably won't occur then go ahead and use the simple s///; but be aware that it's not bulletproof and know where to find the right tool when the sledgehammer doesn't cut it any more.

        Since we're being pedantic about it, is '>' actually allowed inside attribute values in XML?

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://425049]
Approved by Old_Gray_Bear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others about the Monastery: (4)
As of 2022-09-26 19:52 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    I prefer my indexes to start at:




    Results (118 votes). Check out past polls.

    Notices?