Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options

comment on

( #3333=superdoc: print w/replies, xml ) Need Help??

We are using an API (which I can't tell you much about unfortunately) provided by another party which uses POST over HTTPS. On reviewing the code by an ex coworker I discovered a mysterious call to NFKD which I now realise is in Unicode::Normalise. I could not explain why it was there and tried taking it out but it actually breaks things and I'm hoping someone here might have some insights. The API involves POSTing a number of strings to an HTTPS url and the response contains one of 3 statuses (2 mean a match for the supplied data was found and 1 means a match was not found). The suppliers of the API provide some test data which is supposed to be UTF-8 encoded and I have confirmed that in that I can a) find UTF-8 continuation bytes where there are accents/diacritics etc and b) open the file with ':encoding(UTF-8)' and it is read without errors.

The test code opens the test data file ':encoding(UTF-8), reads a line of strings, POSTs them to the url and gets the response. It then checks the response matches the expected response. When run with the url-encoded POST data simple encoded as UTF-8 with a Content-Type" => 'application/x-www-form-urlencoded ; charset=UTF-8" some of the test data fails. When the data is url-encoded and passed through NFKD all of the tests pass. 1) all of the failing tests contain strings which are non ASCII b) it is obvious they are not matching because the status is returning a non match when they are expected to match. An example is Lubomír,Bartoňová. After passing through NFKD, the accent over the i is much larger.

The actual code is even stranger as it does this to the url-encoded strings ($html is the url-encoded strings)

my $decomposedHtml = NFKD( $html ); $decomposedHtml =~ s/\p{NonspacingMark}//g;

but I have no evidence of NonspacingMark ever being in the normalized string.

It seems unlikely the API provider supplied test data which does not match their dataset so that leaves me wondering a) what might be going wrong and b) how the hell did my ex-colleague discover this - it feels like a bodge.

I would greatly appreciate any possible insights from monks here.

In reply to Strange Unicode normalization question by mje

Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":

  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?

    What's my password?
    Create A New User
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others perusing the Monastery: (5)
    As of 2020-11-26 18:39 GMT
    Find Nodes?
      Voting Booth?

      No recent polls found