Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

UTF-8 problem parsing XML

by rinceWind (Monsignor)
on Apr 14, 2007 at 14:17 UTC ( #610066=perlquestion: print w/ replies, xml ) Need Help??
rinceWind has asked for the wisdom of the Perl Monks concerning the following question:

I have the following fairly simple script to parse BookMooch data:

#!/usr/bin/perl use strict; use warnings; use CGI qw(:standard); use CGI::Carp; use WWW::Mechanize; use XML::Simple; use YAML; use Encode; my $mech = WWW::Mechanize->new; my $api_base = 'http://api.bookmooch.com/api/userid'; my $q = CGI->new; my $user = $q->param('user'); $mech->get("$api_base?userids=$user"); die "Failed to get user $user from BookMooch" unless $mech->success; print header, start_html, "\n" ; my $xml = $mech->content; # $xml = encode('iso-8859-1', $xml); (doesn't fix the problem) my $data = XMLin($xml); print pre(Dump($data)), end_html;

When I run it, I get the following output:

Content-Type: text/html; charset=ISO-8859-1 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" lang="en-US" xml:lang="en-U +S"> <head> <title>Untitled Document</title> <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1 +" /> </head> <body> :1492: parser error : Input is not proper UTF-8, indicate encoding ! Bytes: 0xA3 0x31 0x20 0x31 <condition>Good condition. 1970 edition with n +et cover price shown as 1 15s + ^ at /usr/lib/perl5/XML/LibXML/SAX/Parser.pm line 31

Uncommenting the line with encode, doesn't make any difference. I'm new to UTF-8 and encoding. What's the correct incantation for what I'm doing?

The input data is Latin-1 as far as I'm aware, and it's b0rking on a pound sign '£'

Any help would be much appreciated

--
Apprentice wetware hacker

Comment on UTF-8 problem parsing XML
Select or Download Code
Replies are listed 'Best First'.
Re: UTF-8 problem parsing XML
by Joost (Canon) on Apr 14, 2007 at 15:54 UTC
    Assuming the input file is really latin-1, you can do a couple of things:

    • Add an XML declaration with an encoding attribute to the XML file - <?xml encoding="iso-8859-1"?> - that should make sure the XML parser will do the right thing.
    • Convert the XML to utf8, which you've tried but you got it backwards. The correct statement is $xml = decode('iso-8859-1', $xml);
    • Figure out some other way to signal the real encoding to XML::Simple. Not sure if you can.

      Thanks! Changing encode to decode did the trick.

      --
      Apprentice wetware hacker

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://610066]
Approved by wfsp
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others romping around the Monastery: (10)
As of 2015-07-07 20:52 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (93 votes), past polls