Since JSON is (supposed to be) UTF-8, you merely need to mark the resulting data as being UTF-8 decoded. You could even do it for all string data, assuming that all your input has been verified as UTF-8. See for example Re: Bypass utf-8 encoding/decoding?, the function/macro you want is newSVpvn_utf8.
Obviously, this implies that you're trusting your input data to actually be valid UTF-8...
| [reply] [d/l] |
newSVpvn_utf8 sounds awesome!. Is there some simple way to detect invalid UTF-8 ?
I guess something, somewhere, knows this - since croak() is the bane of my existence right now: email subject lines which may or may not have been truncated somewhere are 100% guaranteed to spew invalid UTF-8 at *some* point.
Is there some way perl can auto-magically handle UTF-16 as well? e.g. (from the RFC): "... UTF-16 surrogate pair. So, for example, a string containing only the G clef character (U+1D11E) may be represented as "\uD834\uDD1E"." (those 4 bytes (and/or 12 characters) are also an example of why truncated text breaks everything I expect)
| [reply] |
See Encode::Unicode for the translations between the various Unicode encodings.
I think converting from UTF-16 to UTF-8 is merely a mathematical transformation between two encoding styles of the same number, so you can easily model that. I'm not sure how easy it is to determine whether a backslash-escaped sequence is UTF-8 or UTF-16, but maybe if it's just two characters, it's UTF-8.
| [reply] |
bool json_parse_more(pTHX_
struct json_parse_state *state, // configuration and error messages
SV *input, // any scalar
int input_pos, // byte offset within the scalar
SV *output // empty SV, destination for data
);
and you could call that recursively to assign the output SV with the progress-so-far of whatever it found on input. As long as the state was unique to the thread, it would be thread-safe. It's probably easiest to store all the error info into the struct.
You could probably read the implementations of all the other JSON modules to flesh out the implementation of that one function, then you could wrap that one function in XS, along with some XS methods to construct/read/write the state struct, and you'd be on your way.
When you get to the part of decoding unicode, you'll see the solutions in all the other JSON modules, but you need to fully understand what they're solving. A perl SV can either be raw bytes or Characters, and the Perl is_utf8 flag is *not* a proper indication of this. The perl is_utf8 flag only indicates to the back-end whether you need to use utf8 functions to read the characters or if there is one character per byte. There can be cases where a byte > 127 is stored as a utf8 sequence even though it wasn't intended by the application to be a character yet. So, you need to let the user specify whether they think their string contains bytes or characters when calling your API, then do the decoding in your module if they say the input needs decoded. Again, the solutions for these problems will all be found in the other existing JSON modules. As it happens, the UTF8 rant by MLEHMANN in the JSON::XS manual is the explanation that finally showed me the right way to think about Perl's utf8 flag. | [reply] [d/l] [select] |