Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things

comment on

( #3333=superdoc: print w/replies, xml ) Need Help??

I have a specialized character encoding scheme that I am trying to get working with Encode. I'm so close, and yet not quite there. The encoding is known as "Daud" by its users.

The encoding scheme

The objective is to represent accented characters (primarily Latin-1 but from a number of other pages) in ASCII in a lossless manner. Non-ASCII characters are encoded as a one or two character string within braces. The characters (two in nearly every case) have mnemonic value.

For example, 'LATIN CAPITAL LETTER A WITH CIRCUMFLEX', U+00C1, is encoded as {A^}. The encoding is typically the base letter plus a modifier. Just to be different, 'LATIN SMALL LETTER DOTLESS I' is encoded as {i}. Hold that thought.

The lone problem

{i} does not properly get translated into U+0131, although it goes the other way just fine.

What I did

I generated a .ucm file. I started with 8859-1 and hacked. The file looks like:

<code_set_name> "daud" <mb_cur_min> 1 <mb_cur_max> 4 <subchar> \x3F CHARMAP ... #<U007B> \x7B |0 # LEFT CURLY BRACKET <U007C> \x7C |0 # VERTICAL LINE #<U007D> \x7D |0 # RIGHT CURLY BRACKET ... <U00C2> \x7B\x41\x5E\x7D |0 # LATIN CAPITAL LETTER A WITH CIRCUMFLEX ... <U0131> \x7B\x69\x7D |0 # LATIN SMALL LETTER DOTLESS I ... END CHARMAP

I found that I had to comment out at least the left curly bracket to make this work at all. That's fine, as it's not an independently valid character in text in this encoding.

I wrote a test file that exercises each and every character, converting the Unicode character into the Daud equivalent, and then doing a round-trip Unicode -> Daud -> Unicode. The test file passes 665/666. The only test case that is failing is the round trip for {i}.

my $string = decode("daud", encode("daud", $tests{$name}->{unicode +}));
leaves $string empty.

What else have I observed?

I used enc2xs to convert the UCM file into Daud_t.c. My examination of that C file leaves me even more puzzled. I can see how the data structures there appear to correctly map {i} to a Unicode character.

Does anyone have any useful insights or dope-slaps for the obvious thing I'm missing? Is there other information I should have provided that I held back because I didn't want to make a total dump of the stuff?

Update 24 hours later

Thank you graff. Moving the dotless i line higher in the .ucm file did the trick. I was also able to keep the tests passing when I uncommented the RIGHT CURLY BRACE, but the LEFT CURLY BRACE needed to stay out of circulation. So far, at least.

Encode comes with some utilities, including a sort utility, but that did not reorder things into the working order, so that was a bust.

By the time I got to writing a SOPW, I had run out of creative ideas. Now to move forward with the companion encodings that do lossy conversions to Latin-1 and ASCII. And play around with some fallback conversions. Wheee!


In reply to Encoding: my custom encoding fails on one character but works for everything else?! by herveus

Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":

  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or or How to display code and escape characters are good places to start.
Log In?

What's my password?
Create A New User
Domain Nodelet?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (3)
As of 2022-01-18 07:35 GMT
Find Nodes?
    Voting Booth?
    In 2022, my preferred method to securely store passwords is:

    Results (52 votes). Check out past polls.