in reply to Re: XML::Twig and UTF-8
in thread XML::Twig and UTF-8

Thanks for the help. Preprocessing the text with Text::Unidecode certainly does the trick.

I must be misunderstanding something, though. If I don't preprocess, I get a not well-formed (invalid token) at line 2, column 25, byte 68 at C:/Perl_588/lib/XML/Parser.pm line 187 error, which I thought was due to the unicode char in the input, but your comment states that it shouldn't be a problem. What am I doing wrong?

<Text>5CH (the BACKSLASH \ in ISO-IR 6) shall</Text> 01234567890123456789012345 1 2 ^

Updated with results of Text::Unicode and repeated input text from DATA

Replies are listed 'Best First'.
Re^3: XML::Twig and UTF-8
by mirod (Canon) on Sep 19, 2008 at 07:13 UTC

    There are plenty of places where things could go wrong. I don't know what your environment is, so I don't know exactly what you are parsing (ie in which encoding it is, what you locale setting is (UTF8 or not?), which version of perl you are using (I don't think a DATA section in utf8 would work before 5.8.1, try parsing the data from a file).

    For the record I had no problem running the code in your original question, so there is nothing wrong with it.

Re^3: XML::Twig and UTF-8
by AZed (Monk) on Sep 19, 2008 at 05:31 UTC

    To get an answer to that, you'd have to tell us what is at line 2, column 25, byte 68 of your input. (And likely some of the surrounding text as well.)

    Sorry, that's what I get for trying to reply at that hour. I have replicated the problem now by downloading your source and running recode utf8..latin1 on it.

    The problem is with the XML parser before it ever actually reaches Twig -- the twig encoding filters are to convert parsed information from one encoding to another, but they can't actually affect the parsing itself. The issue is that your xml declaration has told the parser to expect one encoding, but it has received another. In other words, if you had:

    __DATA__ <?xml version = '1.0' encoding = 'iso-8859-1'?> <Text>5CH (the BACKSLASH \ in ISO-IR 6) shall</Text>
    ... and you were absolutely certain that your source file had latin-1 encoding, you wouldn't have to mess with input filters at all. This would be sufficient to deal with it:
    my $twig = XML::Twig->new(); $twig->parse( $xml );

    If you later recoded that file to utf-8 (via something like recode latin1..utf8 filename), you might have problems with the charset again, though odds are that it would actually parse and give you garbage. THEN you might need to play with an input filter, not to get the parsing working, but to convert the garbage you got out of it to what you wanted.

      Isn't that given in __DATA__, which the listed code reads?