Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses

Converting entities in JSON context

by Anonymous Monk
on May 19, 2022 at 11:45 UTC ( #11144000=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I'm struggling with something that I thought would be very simple. I have a legacy system which sends data in JSON. The underlying data, which I can't change, uses HTML entities. I need to convert this to UTF8, because a receiving system can't handle the entities. I wrote a one-line test for this, which is failing, and I don't know why.

When I do the conversion on the text itself, it looks fine. When I do the conversion on the JSON, it also looks fine, but when I decode the JSON for the test, it seems to re-convert the UTF8 JSON elements into something wrong. A simple test case:

#!/usr/bin/env perl use strict; use warnings; use HTML::Entities; use Encode; use JSON::MaybeXS; my $original_string = "Eötvös Loránd University"; my $converted_string = encode_utf8( decode_entities($original_string) +); print "Original string: [$original_string]\n"; # shows the entities print "Converted string: [$converted_string]\n"; # shows the special c +haracters my $entities_json = '{"school":"Eötvös Loránd Uni +versity"}'; my $converted_json = encode_utf8(decode_entities($entities_json)); print "Original JSON: [$entities_json]\n"; # shows the entities print "Converted JSON: [$converted_json]\n"; # looks right: shows the +special characters my $decoded_json = decode_json($converted_json); print "School: " . $decoded_json->{'school'} . "\n"; # should be "Eötv +ös Loránd University" but is actually "�tv�s Lor&#65533 +;nd University", with the special characters messed up (N.B. Perlmonk +s is showing this incorrectly as well)
What is going on here? And, how am I supposed to convert my JSON-with-entities to something, well, correct?

Replies are listed 'Best First'.
Re: Converting entities in JSON context
by choroba (Archbishop) on May 19, 2022 at 12:48 UTC
    You can fix the output by enabling the correct IO layer for the output:
    binmode *STDOUT, ':encoding(UTF-8)';

    Perl JSON modules keep the strings in parsed structures as characters, but when serializing to JSON strings, they use bytes and UTF-8 encoding.


    According to the specification, JSON doesn't use entities, but it can use the \uXXXX notation, so instead of using HTML::Entities, you can try

    sub convert { my ($s) = @_; $s =~ s/&#x([[:xdigit:]]{4});/\\u$1/gr } my $entities_json = '{"school":"Eötvös Loránd Uni +versity"}'; my $converted_json = convert($entities_json); print "Original JSON: [$entities_json]\n"; print "Converted JSON: [$converted_json]\n"; # [{"school":"E\u00F6tv\ +u00F6s Lor\u00E1nd University"}] my $decoded_json = decode_json($converted_json); binmode STDOUT, ':encoding(UTF-8)'; print "School: " . $decoded_json->{'school'} . "\n"; # Eötvös Loránd +University

    map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]

      I'm not sure I understand. It doesn't seem to be about the display only. If I write a test for this (which is where I originally discovered this situation), something like:

      is($decoded_JSON->{'school'}, "Eötvös Loránd University", "convert_ent +ities correctly converted HTML entities in a JSON context, and yielde +d good JSON at the end");

      , this test fails. What should I be expecting from the test, or how do I write a test to make sure that the JSON I'm sending is what I should be sending?

        Here, you're using non-ASCII characters in the source code (the second argument to is).

        To tell Perl how to interpret them, you need to

        use utf8;

        This makes Perl interpret the part of the source in the lexical scope of the pragma as UTF-8 encoded. ( Update: And you need to save the source as UTF-8, too, of course.)

        map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
Re: Converting entities in JSON context
by Don Coyote (Hermit) on May 19, 2022 at 12:57 UTC

    Does *?Module?::decode_json() do what you think it is doing?.

    JSON::MaybeXS chooses a module from which the decode_json sub is searched for, as it is not defined within this module.

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11144000]
Approved by davies
Front-paged by Corion
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chanting in the Monastery: (6)
As of 2022-06-26 16:32 GMT
Find Nodes?
    Voting Booth?
    My most frequent journeys are powered by:

    Results (86 votes). Check out past polls.