Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine

XML and Latin1 issue

by mike240se (Initiate)
on Nov 22, 2007 at 07:47 UTC ( #652321=perlquestion: print w/replies, xml ) Need Help??
mike240se has asked for the wisdom of the Perl Monks concerning the following question:

Hi all i am a newbie with perl, actually not even a newbie, i havent really started to learn perl, i do know php and various others but never touched perl. i have a perl script that downloads the weather from an xml and puts it into speech using festival. it recently stopped working and ive tracked the problem to them adding latin1 to the encoding="" in their xml. so xml::Simple and parser are complaining cause its in Latin1 and they dont know what that is. i figure just striping out LATIN1 would fix it since i am only using 0-9, a-z, and regular chars like comma, period, semi colon, no specials at all. this is the relevant code, xml, and error:
#!/usr/bin/perl use POSIX; use XML::Parser; use XML::Simple; use Data::Dumper; use LWP::Simple; #use Switch; my $url = " +a.asp?zip$ my $file = get($url); $file =~ s/encoding="[^"] "//; my $xs1 = XML::Simple->new(); if(-e "/var/lib/asterisk/sounds/currentconditions.ulaw") { unlink("/var/lib/asterisk/sounds/currentconditions.ulaw"); } if(-e "/var/lib/asterisk/sounds/planets.ulaw") { unlink("/var/lib/asterisk/sounds/planets.ulaw"); } if(-e "/var/lib/asterisk/sounds/forecast.ulaw") { unlink("/var/lib/asterisk/sounds/forecast.ulaw"); } my $doc = $xs1->XMLin($file, ForceContent => 1); # this is the current conditions section my $city = $doc->{CurrentConditions}->{City}->{content}; $city =~ s/^\s*(.*?)\s*$/$1/; my $speech_text = "Current conditions for $city, "; print "City: " , $city , "\n";
I believe that is the releavant parts the rest is similiar code... this is the relevant part of the xml:
<?xml version="1.0" encoding="LATIN1"?> <adc_Database> <WatchWarnAreas zone="NJZ008" county="NJC027"/> <GmtDiff DayLightSavings="0" > -5 </GmtDiff> <CurrentConditions> <City> Lake Hopatcong </City> <State> NJ </State>

i believe they added that latin1 part recently is why its not working :(

the error is:
Couldn't open encmap latin1.enc: No such file or directory at /usr/local/lib/perl/5.8.4/XML/ line 187

UPDATE: i added
$file =~ s{encoding="LATIN1 "}{encoding=""};
which fixed the latin1.enc error but now i get the following error:
XML declaration not well-formed at line 1, column 30, byte 30 at /usr/local/lib/ perl/5.8.4/XML/ line 187

I am happy to report i got this working by changing my code to:
$file =~ s{encoding="LATIN1 "}{encoding="utf-8"};
i guess it wanted utf-8 instead of blank encoding. thanks all.

Replies are listed 'Best First'.
Re: XML and Latin1 issue
by bart (Canon) on Nov 22, 2007 at 12:02 UTC
    "LATIN1" is not a valid encoding. Try "ISO-Latin-1", "ISO-8859-1", or "Windows-1252". One of those is bound to work. Your "UTF-8" will work fine... until you get accented characters in the data.

    And I hate it, how XML implementors are making shit up as they go along.

Re: XML and Latin1 issue
by ikegami (Pope) on Nov 22, 2007 at 21:08 UTC


    if(-e "/var/lib/asterisk/sounds/currentconditions.ulaw") { unlink("/var/lib/asterisk/sounds/currentconditions.ulaw"); } if(-e "/var/lib/asterisk/sounds/planets.ulaw") { unlink("/var/lib/asterisk/sounds/planets.ulaw"); } if(-e "/var/lib/asterisk/sounds/forecast.ulaw") { unlink("/var/lib/asterisk/sounds/forecast.ulaw"); }

    can be simplified to

    unlink( "/var/lib/asterisk/sounds/currentconditions.ulaw", "/var/lib/asterisk/sounds/planets.ulaw", "/var/lib/asterisk/sounds/forecast.ulaw", );

    unlink tries to delete all the files, even if it can't delete some of them.

    >copy nul a 1 file(s) copied. >copy nul c 1 file(s) copied. >dir /b a c >perl -e"unlink qw( a b c )" >dir /b >
Re: XML and Latin1 issue
by olus (Curate) on Nov 22, 2007 at 13:50 UTC
    If you are using only plain chars ASCII would be enough.
    But the producers of the XML file are telling you that it might not be so, so you should expect to see some non plain ascii chars in the future.
    As bart correctly pointed out, telling XML::Simple that the stream of characters you are passing is in unicode format solves your problem until the moment when the xml actually contains any special character that is not in unicode but in latin-1.
    So, in addition to changing the information about the encoding format you must also convert the stream to utf-8
    Unfortunately, examples like the one you brought here are abundant on the web, where the encoding info is not accurate and many people suffer from headaches dealing with such misleadings.
    Some techniques are used to try and avoid such problems, like parsing the info assuming unicode format, and if that fails try the conversion from another encoding, and so on...
    I'd recommend the following reading:
    Perl Unicode
    The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://652321]
Approved by Corion
Front-paged by Corion
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others surveying the Monastery: (13)
As of 2016-10-28 14:07 GMT
Find Nodes?
    Voting Booth?
    How many different varieties (color, size, etc) of socks do you have in your sock drawer?

    Results (383 votes). Check out past polls.