Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"

Re: Encoding: my custom encoding fails on one character but works for everything else?!

by graff (Chancellor)
on Sep 14, 2009 at 06:17 UTC ( #795064=note: print w/replies, xml ) Need Help??

in reply to Encoding: my custom encoding fails on one character but works for everything else?!

That's really curious. Is it the case that the "dotless i" / "{i}" is the only point where your "Daud" encoding uses a single character between curlies?

I tried to replicate your situation, just doing the accented "A" characters from latin1 along with adding the dotless i, and I got the same results you got -- trying to decode from "{i}" to "\x{0131}" gave me an empty string, while everything else worked as expected.

I noticed that if the test string for decoding was "{i} " (note the space after the close-curly), it worked just fine (and didn't lose the space, either; any other character in that position would work as well).

Then I added one other code point using one character between curlies to see if that would behave the same way -- inverted q-mark / "{?}" -- and when this was in the ucm file, both the q-mark and the dotless i worked fine without further ado (no extra character needed in the test string).

So, I can't explain it (maybe some other monk can), but see if that works for you:

<code_set_name> "daud" <mb_cur_min> 1 <mb_cur_max> 4 <subchar> \x3F CHARMAP <U0000> \x00 |0 # NULL ... #<U007B> \x7B |0 # LEFT CURLY BRACKET <U007C> \x7C |0 # VERTICAL LINE #<U007D> \x7D |0 # RIGHT CURLY BRACKET ... # I included the next line, defining "{?}": <U00BF> \x7b\x3f\x7d |0 # INVERTED QUESTION MARK <U00C0> \x7b\x41\x60\x7d |0 # LATIN CAPITAL LETTER A WITH GRAVE ... <U0131> \x7b\x69\x7d |0 # LATIN SMALL LETTER DOTLESS I END CHARMAP

UPDATE: Regarding the issue of commenting out the curlies in the ucm file (U007b, U007d), this actually seems to me like a Good Idea™ in its own right. If some poor typist, trying to keyboard text using Daud encoding, happens to put curlies around a character or digraph that is not defined in your ucm file, a decode from that into unicode will yield "\x{fffd}..\x{fffd}" because those particular curlies cannot be decoded (and whatever was between them will be left unchanged). It's just good to know for sure how to identify errors of this kind.

  • Comment on Re: Encoding: my custom encoding fails on one character but works for everything else?!
  • Download Code

Replies are listed 'Best First'.
Re^2: Encoding: my custom encoding fails on one character but works for everything else?!
by herveus (Parson) on Sep 14, 2009 at 11:55 UTC

    Thanks for those findings! There is one other potential case (at present) where there could be a single character between {}, but at present, {i} is the only one I'm using.

    I've long had a module that handled this conversion, but it was a standalone thing, not integrated into the Encode system. A friend wanted to be able to systematically recode file and directory names between UTF-8 and Daud. He found 'convmv' which turns out to be a Perl program that uses Encode to do the heavy lifting. That motivated me to once again try to make Daud work in the Encode framework, leading here.

      I tried a bunch of other variations on the same sort of theme. I have only taken a quick glance at the C code created by enc2xs (and the differences in C code resulting from different variations of ucm), and I still don't understand exactly what's wrong with this process. But based on what I've seen, the symptom appears to be something like this:

      For any mapping that includes strings of 3 or more "non-unicode" bytes equating to unicode code points, the arrangement of definitions in the ucm file should have at least one of the 3-byte strings before any of the longer string definitions. That is:

      ... <Uhhhh> \xhh\xhh\xhh |0 # first mention of any multi-byte mapping ... <Uhhhh> \xhh\xhh\xhh\xhh |0 ...
      The above arrangement will work, whereas the arrangement below will have the problem as described in the OP:
      ... <Uhhhh> \xhh\xhh\xhh\xhh |0 # first mention of any multi-byte mappin +g ... <Uhhhh> \xhh\xhh\xhh |0 ...
      (updated to remove spurious spaces)

      Bear in mind that the ucm definitions don't need to be in unicode code-point order -- enc2xs doesn't care about the ordering (except when it comes to triggering this one strange little bug).

      It probably applies to any mapping involving 2-byte strings as well, (that would seem logical), but I haven't tested that. In a nutshell, try ordering your definitions with respect to how long the encoded byte strings are, or at least put one instance of a 3-byte string mapping ahead of all instances of 4-byte mappings.

      (So it was just "coincidental" that I chose q-mark for the initial attempt in my first reply -- it only worked because it just happened to be conventionally placed above the accented letters.)


        Hmmm... Thank you for that insight. I'll look into that nuance tonight as well.


Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://795064]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others cooling their heels in the Monastery: (4)
As of 2022-01-17 17:21 GMT
Find Nodes?
    Voting Booth?
    In 2022, my preferred method to securely store passwords is:

    Results (51 votes). Check out past polls.