Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

XML::Parser chokes on UTF-8?

by perlcgi (Hermit)
on Dec 13, 2002 at 18:39 UTC ( [id://219684]=perlquestion: print w/replies, xml ) Need Help??

perlcgi has asked for the wisdom of the Perl Monks concerning the following question:

Why does XML::Parser complain about invalid data in a line like?:
<title>Os cem melhores contos brasileiros do século /Italo Moriconi, + organização, introdução e referÃ</title>

What's the best way to fix it?
Thanks
perlcgi

Replies are listed 'Best First'.
•Re: XML::Parser chokes on UTF-8?
by merlyn (Sage) on Dec 13, 2002 at 18:41 UTC
    Do you have the right charset in the XML header?

    And I can't tell, since I don't know the language. Is that UTF-8, or latin-1? If it's latin-1, you definitely need a charset header. I think the default is UTF-8, though.

    -- Randal L. Schwartz, Perl hacker
    Be sure to read my standard disclaimer if this is a reply.

      My apologies, I should have checked - Friday evening frazzle - changing the charset header to iso-8859-1 (Latin-1) cures it.
      Thanks Randal.
Re: XML::Parser chokes on UTF-8?
by mirod (Canon) on Dec 13, 2002 at 19:20 UTC

    The last 'Ã' looks very suspicious: latin1 characters outside of the basic 0-127 range are stored on 2 bytes in UTF-8, they look like 'Ã?', a lone 'Ã' is certainly an error. My guess would be that a cut'n paste went wrong and the last character of the string was lost.

    I used this bit of code to check the string BTW:

    #!/usr/bin/perl -w use strict; use XML::Parser; use Text::Iconv; my $text= <DATA>; Text::Iconv->raise_error(1); my $converter= Text::Iconv->new( utf8 => 'latin1'); my $in_latin1= $converter->convert( $text); print "text in latin1: $in_latin1\n"; __DATA__ <title>Os cem melhores contos brasileiros do século /Italo Moriconi, + organização, introdução e refer</title>
Re: XML::Parser chokes on UTF-8?
by dmitri (Priest) on Dec 13, 2002 at 18:57 UTC
    What do you mean, "complain?" Does it die? Are you using 5.6.1 or 5.8.0. In 5.6.1, MIME::Base64 C code does not know how to handle wide characters, so if your strings are tagged as UTF-8, that's the problem.

    Better yet, post the warning your code produces.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://219684]
Approved by Tanalis
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others cooling their heels in the Monastery: (3)
As of 2024-04-25 19:00 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found