comment on

I tried a bunch of other variations on the same sort of theme. I have only taken a quick glance at the C code created by enc2xs (and the differences in C code resulting from different variations of ucm), and I still don't understand exactly what's wrong with this process. But based on what I've seen, the symptom appears to be something like this:

For any mapping that includes strings of 3 or more "non-unicode" bytes equating to unicode code points, the arrangement of definitions in the ucm file should have at least one of the 3-byte strings before any of the longer string definitions. That is:

...
<Uhhhh>  \xhh\xhh\xhh  |0   # first mention of any multi-byte mapping
...
<Uhhhh>  \xhh\xhh\xhh\xhh |0
...
[download]

The above arrangement will work, whereas the arrangement below will have the problem as described in the OP:

...
<Uhhhh>  \xhh\xhh\xhh\xhh |0  # first mention of any multi-byte mappin
+g
...
<Uhhhh>  \xhh\xhh\xhh  |0
...
[download]

(updated to remove spurious spaces)

Bear in mind that the ucm definitions don't need to be in unicode code-point order -- enc2xs doesn't care about the ordering (except when it comes to triggering this one strange little bug).

It probably applies to any mapping involving 2-byte strings as well, (that would seem logical), but I haven't tested that. In a nutshell, try ordering your definitions with respect to how long the encoded byte strings are, or at least put one instance of a 3-byte string mapping ahead of all instances of 4-byte mappings.

(So it was just "coincidental" that I chose q-mark for the initial attempt in my first reply -- it only worked because it just happened to be conventionally placed above the accented letters.)

In reply to Re^3: Encoding: my custom encoding fails on one character but works for everything else?! by graff
in thread Encoding: my custom encoding fails on one character but works for everything else?! by herveus

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


good chemistry is complicated, and a little bit messy -LW
	PerlMonks