Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

Encoding horridness

by jfrm (Monk)
on Jul 12, 2017 at 13:01 UTC ( #1194918=perlquestion: print w/replies, xml ) Need Help??
jfrm has asked for the wisdom of the Perl Monks concerning the following question:

I have been uploading an XML file to a service provider for a long time with the first line output as:
$cureq .= '<?xml version="1.0" encoding="latin1"?>'."\n"; $cureq .= # lots of other xml stuff open (XML, ">$xmlfile") or return("Could not open $xmlfile"); print $cureq; close XML or return("Could not close $xmlfile");
Now I have to change the encoding from latin1 to UTF-8 and having read around quite a lot now, I realise that I just don't get it. I have tried changing what I thought were the critical 2 lines viz:
$cureq .= '<?xml version="1.0" encoding="UTF-8"?>'."\n"; open (XML, '>:encoding(UTF-8)', $xmlfile) or return("Could not open $x +mlfile");
This creates the file but my service provider now returns 'Invalid XML'. I just don't get it and what's more I cannot think of a way to debug it or investigate more deeply. Any clues for this poor padawan would be appreciated.

Replies are listed 'Best First'.
Re: Encoding horridness
by choroba (Chancellor) on Jul 12, 2017 at 13:07 UTC
    Unfortunately, you haven't provided enough information. How do you include the non-ascii characters into the XML?

    The following creates a well-formed XML:

    #!/usr/bin/perl
    use warnings;
    use strict;
    use utf8;
     
    binmode STDOUT, ':encoding(UTF-8)';
    print "<?xml version='1.0' encoding='utf-8'?><áéíóů˙/>";
    

    Note the utf8 which interprets the characters in the right way. If you're reading the characters from a file, you need to specify the :encoding(UTF-8) layer for it, as well. Etc.

    ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,

      A careless reader might see "utf8 interprets the characters in the right way" and get the idea that it's going to fix all their utf8 woes. To be clear, use utf8 only changes how perl reads your program source code -- probably just your string literals.

        The "the" in "the characters" means I referenced the characters in the element name. utf8 changes how Perl reads your program source, but it does more than string literals:
        #!/usr/bin/perl use warnings; use strict; use feature qw{ say }; use utf8; my $á = 123; say $á;
        ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,
Re: Encoding horridness
by Corion (Pope) on Jul 12, 2017 at 13:10 UTC

    You will also have to make sure that the data you are writing to the XML file has been read properly from your data source and has been properly decoded when reading it.

    Ideally you use Encode and decode all data when you read it into your program and encode it when writing it to your output. You have already taken care of encodeing the output, but the input might not be valid UTF-8 or not be recognized by Perl as such.

    Assuming that your input data is a file with bytes encoded in Latin-1, you could read/decode the data as

    while( <$fh>) { my $payload = decode('Latin-1', $_); };

    For database values, you have the additional fun of finding out as what kind of data/encoding your database actually stores the values.

      Good advice to be sure. But since latin-1 is a subset of unicode, isn't decode('Latin-1', $_) pretty much a no-op?

        No, because high-bit characters/octets in Latin-1 encode differently as octets in UTF-8, and Perl doesn't know what to do with high-bit characters when writing them.

        The OP wants to move from Latin-1 to UTF-8. Latin-1 is not a subset of UTF-8.

Re: Encoding horridness
by runrig (Abbot) on Jul 12, 2017 at 17:58 UTC
    You don't mention, but you have successfully parsed the file that you're generating with XML::LibXML, right?
      No, I haven't but based on your suggestion and on others comments (thanks to all), I will now do that. Incidentally, to answer questions from others, some of the data in the file is coming from a mySQL database and having checked some of the fields are in UTF and some are latin1 so maybe that is the problem (although I believe you are right - my service provider should give more feedback and I am going to badger them to do this). Other values are just coming from the script itself. I read that PERL internally uses UTF-8 format. So doesn't that mean that all data values unless sourced direct from the database are UTF-8 and therefore my latin1 encoded XML should never have worked? Or is it just that I was probably lucky as latin1 is 'almost' a subset of UTF-8?
        I read that PERL internally uses UTF-8 format.

        Where did you read that? Certainly not from perlunitut which says (my emphasis):

        Perl has an internal format, an encoding that it uses to encode text strings so it can store them in memory. All text strings are in this internal format. In fact, text strings are never in any other format!

        You shouldn't worry about what this format is, because conversion is automatically done when you decode or encode.

Re: Encoding horridness
by karlgoethebier (Prior) on Jul 12, 2017 at 17:24 UTC

    Not sure. Hence i don't reply to the OP. I'll praise the lord if i ever fully understand this encoding stuff.

    Shouldn't do something like use open IN  => ":encoding(iso-8859-1)", OUT => ':utf8'; do the job?

    Regards, Karl

    «The Crux of the Biscuit is the Apostrophe»

    perl -MCrypt::CBC -E 'say Crypt::CBC->new(-key=>'kgb',-cipher=>"Blowfish")->decrypt_hex($ENV{KARL});'Help

      In principle yes, but data could also come from the script (fun) or a database (more fun) or a web page (incredible fun). Reading from a file is the easiest way to acquire data provided that the file only contains one encoding of characters.

        "In principle yes..."

        Just another question to Radio Yerevan.

        Thanks and best regards, Karl

        «The Crux of the Biscuit is the Apostrophe»

        perl -MCrypt::CBC -E 'say Crypt::CBC->new(-key=>'kgb',-cipher=>"Blowfish")->decrypt_hex($ENV{KARL});'Help

Re: Encoding horridness
by Anonymous Monk on Jul 12, 2017 at 14:06 UTC
    We don't know what your service provide considers "valid xml." Just for fun, what happens if you try this?
    use Encode qw( encode XMLCREF ); print XML encode('ascii', $cureq, XMLCREF);
Re: Encoding horridness
by sundialsvc4 (Abbot) on Jul 13, 2017 at 00:44 UTC

    Probably the first thing to do would be to contact the service provider to see if they, say, within their log-files or whatnot, can give you additional details about why they rejected your file now.

    If, as has been hypothesized, the actual problem is differences in character encoding between Latin-1 and UTF-8 (and the actual presence of such characters in the data, of course), then you might have a difficult problem indeed.

    A search of CPAN shows a few interesting-looking modules, such as Encoding::FixLatin, which says that it “takes input which may contain characters in more than one encoding and makes a best effort to convert them all to UTF-8 output.”   Whether you could get away with passing your XML-file through such a thing, I know not, although it might be worth a try.   But, it would still just be a hack.

    A (much) more thorough solution would embrace XML::LibXML in all its glory.   This is what that library has to say about “Encodings Support in XML::LibXML”:

    Recall that since version 5.6.1, Perl distinguishes between character strings (internally encoded in UTF-8) and so called binary data and, accordingly, applies either character or byte semantics to them.   A scalar representing a character string is distinguished from a byte string by special flag (UTF8).   Please refer to perlunicode for details.

    XML::LibXML's API is designed to deal with many encodings of XML documents completely transparently, so that the application using XML::LibXML can be completely ignorant about the encoding of the XML documents it works with.   On the other hand, functions like XML::LibXML::Document->setEncoding give the user control over the document encoding.

    Unsurprisingly, XML::Twig has powerful support for encoding, as well.

    Even though this might mean rewriting the code that you already have, it might actually be worth considering doing just that.   If you are dealing with characters that require encoding, and that are presently encoded as Latin1 within your program and/or data, then your existing strategy might just have been shown to be insufficiently robust.   Such that it might well be justifiable to, so to speak, “bite the bullet (once ...) and fix it right.”

    “After all, this shoulda been very simple ... just change the output encoding from Latin1 to UTF8.”   And ... on the as-yet untested(!) presumption that this is the actual source of your woes right now, it properly should have been.

    - - -

    Incidentally, runrig’s recent recommendation to validate your file with, specifically, XML::LibXML, is a very prescient one, which unfortunately I could up-vote only one time.   Under the hood, this package leverages libxml2, which is an industry-standard binary library for handling “all things XML.”   It is extremely likely that your vendor is using this specific software (library) to process your file, no matter what programming-language they might be using to drive it.   Therefore, you might find that it will tell you what is wrong with your present file, perhaps before you undertake to rewrite your present application, if your vendor cannot be very-easily cajoled to do so.   This package is very comprehensive, very powerful, very “standard,” and yet, very easy to use.

    Anonymous Monk’s caution about the Perl meaning of use utf8 is also very, very important ... someone’s sure missing out on some well-deserved experience points right now.

    - - -

    Backspace ... backspace ... backspace ...   Even as I write this, my spidey-sense is going off, telling me that I am jumping to the conclusion that “Latin1 vs. UTF-8” ... and the presence of characters which are impacted by this ... is the actual reason why the vendor is rejecting the file.   But, do we know that?   Well, we don’t yet, but we sure do have the easy means with which to find out!   I think that what I would do next, right now is to cobble-up a little program that asks XML::LibXML (specifically ...) to have a look at the output-file that you’ve got right now.   Before doing anything else.   This should give you the detailed answer that you seek.   (libxml2 is an authority.)

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1194918]
Approved by marto
help
Chatterbox?
What's the matter? Cat got your tongue?...

How do I use this? | Other CB clients
Other Users?
Others imbibing at the Monastery: (3)
As of 2017-09-23 14:13 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    During the recent solar eclipse, I:









    Results (272 votes). Check out past polls.

    Notices?