herveus has asked for the wisdom of the Perl Monks concerning the following question:


I have a specialized character encoding scheme that I am trying to get working with Encode. I'm so close, and yet not quite there. The encoding is known as "Daud" by its users.

The encoding scheme

The objective is to represent accented characters (primarily Latin-1 but from a number of other pages) in ASCII in a lossless manner. Non-ASCII characters are encoded as a one or two character string within braces. The characters (two in nearly every case) have mnemonic value.

For example, 'LATIN CAPITAL LETTER A WITH CIRCUMFLEX', U+00C1, is encoded as {A^}. The encoding is typically the base letter plus a modifier. Just to be different, 'LATIN SMALL LETTER DOTLESS I' is encoded as {i}. Hold that thought.

The lone problem

{i} does not properly get translated into U+0131, although it goes the other way just fine.

What I did

I generated a .ucm file. I started with 8859-1 and hacked. The file looks like:

<code_set_name> "daud" <mb_cur_min> 1 <mb_cur_max> 4 <subchar> \x3F CHARMAP ... #<U007B> \x7B |0 # LEFT CURLY BRACKET <U007C> \x7C |0 # VERTICAL LINE #<U007D> \x7D |0 # RIGHT CURLY BRACKET ... <U00C2> \x7B\x41\x5E\x7D |0 # LATIN CAPITAL LETTER A WITH CIRCUMFLEX ... <U0131> \x7B\x69\x7D |0 # LATIN SMALL LETTER DOTLESS I ... END CHARMAP

I found that I had to comment out at least the left curly bracket to make this work at all. That's fine, as it's not an independently valid character in text in this encoding.

I wrote a test file that exercises each and every character, converting the Unicode character into the Daud equivalent, and then doing a round-trip Unicode -> Daud -> Unicode. The test file passes 665/666. The only test case that is failing is the round trip for {i}.

my $string = decode("daud", encode("daud", $tests{$name}->{unicode +}));
leaves $string empty.

What else have I observed?

I used enc2xs to convert the UCM file into Daud_t.c. My examination of that C file leaves me even more puzzled. I can see how the data structures there appear to correctly map {i} to a Unicode character.

Does anyone have any useful insights or dope-slaps for the obvious thing I'm missing? Is there other information I should have provided that I held back because I didn't want to make a total dump of the stuff?

Update 24 hours later

Thank you graff. Moving the dotless i line higher in the .ucm file did the trick. I was also able to keep the tests passing when I uncommented the RIGHT CURLY BRACE, but the LEFT CURLY BRACE needed to stay out of circulation. So far, at least.

Encode comes with some utilities, including a sort utility, but that did not reorder things into the working order, so that was a bust.

By the time I got to writing a SOPW, I had run out of creative ideas. Now to move forward with the companion encodings that do lossy conversions to Latin-1 and ASCII. And play around with some fallback conversions. Wheee!


Replies are listed 'Best First'.
Re: Encoding: my custom encoding fails on one character but works for everything else?!
by ELISHEVA (Prior) on Sep 13, 2009 at 23:56 UTC

    When in doubt , decompose.

    Have you tried dumping the data structure storing the result of parsing the umc file? Does it look right? If you are storing conversion rules in a hash, what happens if you look up '{i}' (or 'i') independently in the hash? Do you get the right unicode?

    What 'key' do you need to lookup the unicode for '{i}'? If the data structures are storing the right information, what about the code that is supposed to be converting '{i}' into a form suitable for look-up in your encoding data structures. How is '{i}' being extracted? What happens if you separate out the code that extracts '{i}' and run it separately? Do you get a key suitable for look-up? Or do you get '{i' or 'i}' or even '{i}' when in fact you need '{i}' (or vice versa)?

    I'd also investigate further why it was necessary to comment out the mapping for the left curly brace. Are right curlies also a problem? If not, what is special about left curlies? Is this a hint about how the encoding/decoding algorithm works? Perhaps there is something deeper that connects the problems with left curlies and '{i}'?

    Best, beth


      Daud_t.c contains a whole mess of declarations of constants. Each constant name includes the to and from encodings and the subsequence of bytes matched. I was able to see the sequence for {i} in there, and it appeared to be correct. The conversion process seems to be to chase pointers through a tangled mess ultimately leading to an output character. Or something along those lines. I'm doing a poor job of describing it, but I did not see anything out of line there.

      I empirically determined that I needed to comment out LEFT CURLY BRACKET. My surmise is that if it was legal in its own right, it would get consumed early spoiling the matches...waitaminnit...(thinking as I type) I'll try moving that to the end to see if that allows it to catch otherwise unmatched {.

      In the Daud encoding, curlies are the brackets around non-ASCII characters. They are, thus, special in this context. Hmmm...

      Thanks for the thoughts.

Re: Encoding: my custom encoding fails on one character but works for everything else?!
by graff (Chancellor) on Sep 14, 2009 at 06:17 UTC
    That's really curious. Is it the case that the "dotless i" / "{i}" is the only point where your "Daud" encoding uses a single character between curlies?

    I tried to replicate your situation, just doing the accented "A" characters from latin1 along with adding the dotless i, and I got the same results you got -- trying to decode from "{i}" to "\x{0131}" gave me an empty string, while everything else worked as expected.

    I noticed that if the test string for decoding was "{i} " (note the space after the close-curly), it worked just fine (and didn't lose the space, either; any other character in that position would work as well).

    Then I added one other code point using one character between curlies to see if that would behave the same way -- inverted q-mark / "{?}" -- and when this was in the ucm file, both the q-mark and the dotless i worked fine without further ado (no extra character needed in the test string).

    So, I can't explain it (maybe some other monk can), but see if that works for you:

    <code_set_name> "daud" <mb_cur_min> 1 <mb_cur_max> 4 <subchar> \x3F CHARMAP <U0000> \x00 |0 # NULL ... #<U007B> \x7B |0 # LEFT CURLY BRACKET <U007C> \x7C |0 # VERTICAL LINE #<U007D> \x7D |0 # RIGHT CURLY BRACKET ... # I included the next line, defining "{?}": <U00BF> \x7b\x3f\x7d |0 # INVERTED QUESTION MARK <U00C0> \x7b\x41\x60\x7d |0 # LATIN CAPITAL LETTER A WITH GRAVE ... <U0131> \x7b\x69\x7d |0 # LATIN SMALL LETTER DOTLESS I END CHARMAP

    UPDATE: Regarding the issue of commenting out the curlies in the ucm file (U007b, U007d), this actually seems to me like a Good Idea™ in its own right. If some poor typist, trying to keyboard text using Daud encoding, happens to put curlies around a character or digraph that is not defined in your ucm file, a decode from that into unicode will yield "\x{fffd}..\x{fffd}" because those particular curlies cannot be decoded (and whatever was between them will be left unchanged). It's just good to know for sure how to identify errors of this kind.


      Thanks for those findings! There is one other potential case (at present) where there could be a single character between {}, but at present, {i} is the only one I'm using.

      I've long had a module that handled this conversion, but it was a standalone thing, not integrated into the Encode system. A friend wanted to be able to systematically recode file and directory names between UTF-8 and Daud. He found 'convmv' which turns out to be a Perl program that uses Encode to do the heavy lifting. That motivated me to once again try to make Daud work in the Encode framework, leading here.

        I tried a bunch of other variations on the same sort of theme. I have only taken a quick glance at the C code created by enc2xs (and the differences in C code resulting from different variations of ucm), and I still don't understand exactly what's wrong with this process. But based on what I've seen, the symptom appears to be something like this:

        For any mapping that includes strings of 3 or more "non-unicode" bytes equating to unicode code points, the arrangement of definitions in the ucm file should have at least one of the 3-byte strings before any of the longer string definitions. That is:

        ... <Uhhhh> \xhh\xhh\xhh |0 # first mention of any multi-byte mapping ... <Uhhhh> \xhh\xhh\xhh\xhh |0 ...
        The above arrangement will work, whereas the arrangement below will have the problem as described in the OP:
        ... <Uhhhh> \xhh\xhh\xhh\xhh |0 # first mention of any multi-byte mappin +g ... <Uhhhh> \xhh\xhh\xhh |0 ...
        (updated to remove spurious spaces)

        Bear in mind that the ucm definitions don't need to be in unicode code-point order -- enc2xs doesn't care about the ordering (except when it comes to triggering this one strange little bug).

        It probably applies to any mapping involving 2-byte strings as well, (that would seem logical), but I haven't tested that. In a nutshell, try ordering your definitions with respect to how long the encoded byte strings are, or at least put one instance of a 3-byte string mapping ahead of all instances of 4-byte mappings.

        (So it was just "coincidental" that I chose q-mark for the initial attempt in my first reply -- it only worked because it just happened to be conventionally placed above the accented letters.)