http://www.perlmonks.org?node_id=427443

dbrock has asked for the wisdom of the Perl Monks concerning the following question:

Hello...
I am trying to convert a XML file encoded in utf-16... I am trying to strip the XML file into a flat ascii text file... I am trying to do this by using regex...
$file = "c:\\temp\\1.xml"; $out = "c:\\temp\\output.txt"; open (FH, $file)or die "Cannot Open $file :$!"; open(OUT, ">$out")or die "Cannot Open $out :$!"; while(<FH>) { s/^.*(<.*>)//g; s/(?<=\w) (?=\w)//g; s/\n\n/\n/g; s/ / /g; print OUT $_; } close FH; close OUT;

I have successfuly striped out all of the XML tags... However the double spacing and double line returns are still in the output...
--------Example of the output TEXT--------- = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = + = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = += J o b s e r v e r : S E R V E R N A M E J o b n a m e : S E R V E R N A M E - I n c J o b s t a r t e d : M o n d a y , D e c e m b e r 2 7 , 2 + 0 0 4 a t 2 : 5 3 : 3 8 P M J o b t y p e : B a c k u p J o b L o g : B E X 0 0 1 6 4 . x m l = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = + = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = += D r i v e a n d m e d i a i n f o r m a t i o n f r o m m e + d i a m o u n t : R o b o t i c L i b r a r y N a m e : C O M P A Q 1 D r i v e N a m e : C O M P A Q 1 S l o t : 1 M e d i a L a b e l : D S W 0 0 0 M e d i a G U I D : { 4 3 1 B 0 3 D E - 1 C 4 9 - 1 1 D 4 - B 2 1 + C - 0 0 5 0 8 B C A 3 A 6 8 } O v e r w r i t e P r o t e c t e d U n t i l : 1 / 3 0 / 2 0 0 + 5 3 : 1 4 : 4 1 A M A p p e n d a b l e U n t i l : 1 2 / 3 1 / 9 9 9 9 1 2 : 0 0 : + 0 0 A M T a r g e t e d M e d i a S e t N a m e : D a i l y = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = + = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = += J o b O p e r a t i o n - B a c k u p M e d i a o p e r a t i o n - a p p e n d . H a r d w a r e c o m p r e s s i o n e n a b l e d . = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = + = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = += ----------------------End output Example----------

If any one could tell me what I'm failing to do correctly I would be able to continue my script upgrade...
Thank you
DBrock...

Replies are listed 'Best First'.
Re: Decoding UTF-16 to ASCII
by fauria (Deacon) on Feb 02, 2005 at 22:13 UTC
    Using a CPAN module for parsing XML is a good idea.

    XML::Simple is really simple to use. Just:
    use XML::Simple; use Data::Dumper; my $xml_hashref = XMLin($file); print Dumper $xml_hashref; #Just to see how it is structured

    Then, you can iterate over the hasref, and translate one space into no-space, and two spaces into one:

    $xml_hashref->{tag1}->{nested_tag}{value} =~ s/\s//gs


    This node describes how to iterate hashrefs.
Re: Decoding UTF-16 to ASCII
by Anonymous Monk on Feb 03, 2005 at 10:25 UTC
    simple translator:
    use open IN => ':encoding(utf-16)', OUT => ':encoding(us-ascii)'; print while <>;
    Of course, this breaks when there are characters which are not in the us-ascii character set. Perhaps you want to translate into utf-8 instead?
Re: Decoding UTF-16 to ASCII
by graff (Chancellor) on Feb 07, 2005 at 01:58 UTC
    The previous two replies, taken together, provide the right answers for doing both XML tag removal and character conversion out of UTF-16. But I think it's important to draw attention to a couple more details, by way of explanation.

    To say that the code you posted is "successful" at stripping out the XML tags is to stretch the definition of "success" a bit, to include results like "yeah, some of the data is missing too, but all the XML tags are gone, so that's success!"

    In this line of your code:

    s/^.*(<.*>)//g;
    the final "g" is superfluous -- never has any effect -- because there is a greedy match removing everything from the first "<" to the last ">" in a given string. So a line of data like this:
    <tag1> data <tag2> more data </tag2><tag2> even more data </tag2></tag +1>
    will need just one application of your regex to end up as an empty string (regardless of whether it's UTF-16 or whatever). Maybe you think you know your particular XML data well enough that your chosen heuristic will work okay. But someday you'll get some XML data that will break it. That's why you should use an XML module to handle XML data; code based on a module will work on all XML data.

    There might be better solutions than the  XML::Simple::XMLin method suggested above; for example, you could use XML::Parser like this if you just want to strip off the tagging:

    #!/usr/bin/perl use strict; use warnings; use XML::Parser; die "Usage: $0 file.xml\n" unless ( @ARGV == 1 and -f $ARGV[0] ); my $parser = new XML::Parser( Handlers => { Char => \&print_chars }, ProtocolEncoding => 'UTF-16', ); $parser->parsefile( $ARGV[0] ); sub print_chars { print pop; }
    That's all there is to it. Notice the part that says what sort of input character encoding to use. As the data file gets read in, it is converted internally from utf-16 to utf8, and will be printed as utf8 -- and if there happen to be no "wide" (non-ascii) characters in your data files, conversion to utf8 really means conversion to ascii, because ascii is a proper subset of the utf8 character set.

    As for the difference between utf-16 and utf8, it's really a very simple matter for data that consists only of characters in the ascii range: for every single-byte ascii character, attach a null high byte (e.g. \x40 becomes \x{0040}) and voila! the result is the corresponding utf-16 character code.

    The reason you were failing to get rid of the unwanted high-bytes, I think, is that you mistakenly thought these were spaces instead of null bytes. (I think the Windows "MS-DOS Prompt" window app normally replaces non-displayable character codes -- or at least null bytes -- with spaces.)

    Anyway, it's better to use Perl's encoding tools for doing character conversions, just like it's better to use XML modules for parsing XML.