http://www.perlmonks.org?node_id=582534

devnul has asked for the wisdom of the Perl Monks concerning the following question:

Hello, I am trying to work with a XML document and having some encoding-type issues with it. First, this is the first few bytes of the XML document passed through "od -c":
0000000 377 376 < \0 ? \0 x \0 m \0 l \0 \0 v \ +0 0000020 e \0 r \0 s \0 i \0 o \0 n \0 = \0 " \ +0 0000040 1 \0 . \0 0 \0 " \0 \0 e \0 n \0 c \ +0 0000060 o \0 d \0 i \0 n \0 g \0 = \0 " \0 u \ +0 0000100 t \0 f \0 - \0 1 \0 6 \0 " \0 ? \0 > \ +0 0000120 \r \0 \n \0 < \0 l \0 i \0 s \0 t \0 i \ +0 0000140 n \0 g \0 _ \0 f \0 e \0 e \0 d \0 > \ +0 0000160 \r \0 \n
Of relevance here, I suppose, is that it is encoding="utf-16", which appears to me the XML document is correctly formed, although I am certainly no expert in this area. In plain ASCII, this would look something like:
<?xml version="1.0" encoding="utf-16"?> <listing_feed>
Now, I'm trying to do a regular expression match on this string, like so:
if($string =~ m/xml/) { ... }
.. but this does not work... However, this does:
if($string =~ m/x.m.l./) { ... }
Have I completely misunderstood that this type of regular expression match should work? I've tried using the "Encode" module to re-code this to UTF-8, or even ASCII but nothing I do can make this work.

.. I think I must be missing something obvious here...

- dEvNuL

Replies are listed 'Best First'.
Re: Encoding question
by ikegami (Patriarch) on Nov 07, 2006 at 01:08 UTC
    Perl has two types of strings. Strings of bytes and strings of characters. You wish to match characters, but you're matching bytes. Decode (using Encode's decode function, for example) the bytes into characters.
    use Encode qw( decode ); my $bytes = "x\0m\0l\0"; print(length($bytes), "\n"); # 6 my $chars = decode('utf16le', $bytes); print(length($chars), "\n"); # 3 print($chars =~ /xml/ ?1:0,"\n"); # 1

    Update: Oops, I had utf8 instead of utf16le originally. utf16 will also work with the original string since it contained a BOM.

Re: Encoding question
by Errto (Vicar) on Nov 07, 2006 at 01:22 UTC
    Since it seems like this is coming from a file, the fact that it shows up that way in Perl suggests that you're not reading the file in its actual encoding. To do that, simply open the file like so:
    open my $fh, '<:encoding(utf16le)', $filename;
    Also, since your file seems to have a BOM at the beginning, you may need to do a little
    s/^\x{feff}//
    on the first line of the file.
Re: Encoding question
by graff (Chancellor) on Nov 07, 2006 at 02:42 UTC
    To supplement the good advice given above...

    I've tried using the "Encode" module to re-code this to UTF-8, or even ASCII but nothing I do can make this work.

    The previous replies have given enough info to read the file correctly. If you want to write the data to a new file using utf8 instead of utf-16le, you'd want to make sure to set the output file handle to utf8 mode, and replace the "utf-16" in the opening xml tag with "utf-8":

    open(IN,"<:encoding(utf-16le)","your_file.xml"); { local $/; $_ = <IN>; } s/(xml version="1.0" encoding="utf)-16/$1-8/; # (that was cheating, but what the heck) # do whatever else needs to be done with the data # -- but use a real xml parser for that... then: binmode STDOUT, ":utf8"; # or whatever file handle you need to use print;