|No such thing as a small change|
Strange Unicode normalization questionby mje (Curate)
|on Aug 15, 2018 at 17:37 UTC||Need Help??|
mje has asked for the
wisdom of the Perl Monks concerning the following question:
We are using an API (which I can't tell you much about unfortunately) provided by another party which uses POST over HTTPS. On reviewing the code by an ex coworker I discovered a mysterious call to NFKD which I now realise is in Unicode::Normalise. I could not explain why it was there and tried taking it out but it actually breaks things and I'm hoping someone here might have some insights. The API involves POSTing a number of strings to an HTTPS url and the response contains one of 3 statuses (2 mean a match for the supplied data was found and 1 means a match was not found). The suppliers of the API provide some test data which is supposed to be UTF-8 encoded and I have confirmed that in that I can a) find UTF-8 continuation bytes where there are accents/diacritics etc and b) open the file with ':encoding(UTF-8)' and it is read without errors.
The test code opens the test data file ':encoding(UTF-8), reads a line of strings, POSTs them to the url and gets the response. It then checks the response matches the expected response. When run with the url-encoded POST data simple encoded as UTF-8 with a Content-Type" => 'application/x-www-form-urlencoded ; charset=UTF-8" some of the test data fails. When the data is url-encoded and passed through NFKD all of the tests pass. 1) all of the failing tests contain strings which are non ASCII b) it is obvious they are not matching because the status is returning a non match when they are expected to match. An example is Lubomír,Bartoňová. After passing through NFKD, the accent over the i is much larger.
The actual code is even stranger as it does this to the url-encoded strings ($html is the url-encoded strings)
but I have no evidence of NonspacingMark ever being in the normalized string.
It seems unlikely the API provider supplied test data which does not match their dataset so that leaves me wondering a) what might be going wrong and b) how the hell did my ex-colleague discover this - it feels like a bodge.
I would greatly appreciate any possible insights from monks here.