http://www.perlmonks.org?node_id=200072

ajt has asked for the wisdom of the Perl Monks concerning the following question:

Fellow monks I have hit a Perl 5.6.x unicode awkwardness.

In my application I was posting iso-8859-1 encoded XML to a SAP Business Connector Server. This was recently upgraded, and now it uses uft-8 encoding for both input and output.

So I removed all my iso-8859-1 to utf-8 encoding on the output stage, and used XML::LibXSLT to template it up, and all is well.

The input stage is more of a problem. The input data is iso-8859-1 encoded from browser form input. I've done all the magic I want on it to create an XML file, but the SAP BC server doesn't like it in this encoding, even with the correct encoding declaration. So I converted the data string to utf-8 with the encodeToUTF8 function in XML::LibXML, however when LWP POSTs this to the BC Server, BC complains that the data is no longer valid XML.

What I think is happening, is the one octet test caharacter E9, becomes a two octet pair 00E9 in utf-8. When LWP calculates the length of the string it seems to counts characters not octets, so LWP seems to truncate the file by one octet (in this example) producing invalid XML, so the BC Server dies. If I pad the post with a bunch of spaces at the end, the file goes through okay. Trailing whitespace is ignored by BC's XML parser.

Q1: Does this sound like a plausible explanation?

Q2: Can anyone think of a more elegant solution?

As ever humble thanks in advance....


--
ajt