Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Re^2: Encoding: my custom encoding fails on one character but works for everything else?!

by herveus (Parson)
on Sep 14, 2009 at 11:55 UTC ( #795096=note: print w/replies, xml ) Need Help??


in reply to Re: Encoding: my custom encoding fails on one character but works for everything else?!
in thread Encoding: my custom encoding fails on one character but works for everything else?!

Howdy!

Thanks for those findings! There is one other potential case (at present) where there could be a single character between {}, but at present, {i} is the only one I'm using.

I've long had a module that handled this conversion, but it was a standalone thing, not integrated into the Encode system. A friend wanted to be able to systematically recode file and directory names between UTF-8 and Daud. He found 'convmv' which turns out to be a Perl program that uses Encode to do the heavy lifting. That motivated me to once again try to make Daud work in the Encode framework, leading here.

yours,
Michael
  • Comment on Re^2: Encoding: my custom encoding fails on one character but works for everything else?!

Replies are listed 'Best First'.
Re^3: Encoding: my custom encoding fails on one character but works for everything else?!
by graff (Chancellor) on Sep 14, 2009 at 14:08 UTC
    I tried a bunch of other variations on the same sort of theme. I have only taken a quick glance at the C code created by enc2xs (and the differences in C code resulting from different variations of ucm), and I still don't understand exactly what's wrong with this process. But based on what I've seen, the symptom appears to be something like this:

    For any mapping that includes strings of 3 or more "non-unicode" bytes equating to unicode code points, the arrangement of definitions in the ucm file should have at least one of the 3-byte strings before any of the longer string definitions. That is:

    ... <Uhhhh> \xhh\xhh\xhh |0 # first mention of any multi-byte mapping ... <Uhhhh> \xhh\xhh\xhh\xhh |0 ...
    The above arrangement will work, whereas the arrangement below will have the problem as described in the OP:
    ... <Uhhhh> \xhh\xhh\xhh\xhh |0 # first mention of any multi-byte mappin +g ... <Uhhhh> \xhh\xhh\xhh |0 ...
    (updated to remove spurious spaces)

    Bear in mind that the ucm definitions don't need to be in unicode code-point order -- enc2xs doesn't care about the ordering (except when it comes to triggering this one strange little bug).

    It probably applies to any mapping involving 2-byte strings as well, (that would seem logical), but I haven't tested that. In a nutshell, try ordering your definitions with respect to how long the encoded byte strings are, or at least put one instance of a 3-byte string mapping ahead of all instances of 4-byte mappings.

    (So it was just "coincidental" that I chose q-mark for the initial attempt in my first reply -- it only worked because it just happened to be conventionally placed above the accented letters.)

      Howdy!

      Hmmm... Thank you for that insight. I'll look into that nuance tonight as well.

      yours,
      Michael

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://795096]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others musing on the Monastery: (6)
As of 2021-10-27 13:54 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    My first memorable Perl project was:







    Results (93 votes). Check out past polls.

    Notices?