herveus has asked for the wisdom of the Perl Monks concerning the following question:


I have a specialized character encoding scheme that I am trying to get working with Encode. I'm so close, and yet not quite there. The encoding is known as "Daud" by its users.

The encoding scheme

The objective is to represent accented characters (primarily Latin-1 but from a number of other pages) in ASCII in a lossless manner. Non-ASCII characters are encoded as a one or two character string within braces. The characters (two in nearly every case) have mnemonic value.

For example, 'LATIN CAPITAL LETTER A WITH CIRCUMFLEX', U+00C1, is encoded as {A^}. The encoding is typically the base letter plus a modifier. Just to be different, 'LATIN SMALL LETTER DOTLESS I' is encoded as {i}. Hold that thought.

The lone problem

{i} does not properly get translated into U+0131, although it goes the other way just fine.

What I did

I generated a .ucm file. I started with 8859-1 and hacked. The file looks like:

<code_set_name> "daud" <mb_cur_min> 1 <mb_cur_max> 4 <subchar> \x3F CHARMAP ... #<U007B> \x7B |0 # LEFT CURLY BRACKET <U007C> \x7C |0 # VERTICAL LINE #<U007D> \x7D |0 # RIGHT CURLY BRACKET ... <U00C2> \x7B\x41\x5E\x7D |0 # LATIN CAPITAL LETTER A WITH CIRCUMFLEX ... <U0131> \x7B\x69\x7D |0 # LATIN SMALL LETTER DOTLESS I ... END CHARMAP

I found that I had to comment out at least the left curly bracket to make this work at all. That's fine, as it's not an independently valid character in text in this encoding.

I wrote a test file that exercises each and every character, converting the Unicode character into the Daud equivalent, and then doing a round-trip Unicode -> Daud -> Unicode. The test file passes 665/666. The only test case that is failing is the round trip for {i}.

my $string = decode("daud", encode("daud", $tests{$name}->{unicode +}));
leaves $string empty.

What else have I observed?

I used enc2xs to convert the UCM file into Daud_t.c. My examination of that C file leaves me even more puzzled. I can see how the data structures there appear to correctly map {i} to a Unicode character.

Does anyone have any useful insights or dope-slaps for the obvious thing I'm missing? Is there other information I should have provided that I held back because I didn't want to make a total dump of the stuff?

Update 24 hours later

Thank you graff. Moving the dotless i line higher in the .ucm file did the trick. I was also able to keep the tests passing when I uncommented the RIGHT CURLY BRACE, but the LEFT CURLY BRACE needed to stay out of circulation. So far, at least.

Encode comes with some utilities, including a sort utility, but that did not reorder things into the working order, so that was a bust.

By the time I got to writing a SOPW, I had run out of creative ideas. Now to move forward with the companion encodings that do lossy conversions to Latin-1 and ASCII. And play around with some fallback conversions. Wheee!