Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

(OT) How to deal with non-ascii names

by bluescreen (Friar)
on Aug 13, 2010 at 15:46 UTC ( #854933=perlquestion: print w/ replies, xml ) Need Help??
bluescreen has asked for the wisdom of the Perl Monks concerning the following question:

Hi monks,

I'm working for an online travel agency, and our application is ** or tries to be ** utf-8 compatible because we have to deal with names from all around the globe. This application talks to reservation systems, some of them accept utf8 but some others -really old- only accept ASCII.

A quick solution could be either wipe out non-ASCII chars ( i.e: Marķa becomes Mara ) or transform them to the closest ASCII one ( i.e: Marķa becomes Maria ), that works ok for Spanish, German and some other occidental names. The main problem either approach present is when you have to deal with oriental names ( Korean, Japanese, Chinese, etc ), where there isn't a direct translation between utf8 and ascii and if you wipe out non-ascii chars you would end up with empty string.

Said that my questions are:

  • Did you have to deal with something like this? If so what was your approach?
  • Is there any Perl module that helps translating things into ASCII in a reliable way?

Thanks

Comment on (OT) How to deal with non-ascii names
Re: (OT) How to deal with non-ascii names
by ikegami (Pope) on Aug 13, 2010 at 16:07 UTC
    I faced a similar situation here, although we're primarily concerned with French names. I currently use
    sub clean { my $s = unidecode(shift); $s =~ s/[^A-Za-z0-9'-]+/ /g; return $s; }

    Text::Unidecode's unidecode will strip accents, while the substitution will get rid of leftover weird characters. unidecode will do a lot more than just that, though, so it might be a viable solution for you.

    The main problem either approach present is when you have to deal with oriental names ( Korean, Japanese, Chinese, etc )

    Perhaps you should ask for the name as it appears on the passport. They are romanised.

Re: (OT) How to deal with non-ascii names
by JavaFan (Canon) on Aug 13, 2010 at 16:08 UTC
    I think you should contact the people managing the reservation systems and find out how non-ASCII names should be translated to ASCII for their system. You really do not want customers refused boarding a plane because the name on the reservation does not match what's on their passports.

    There are transliteration systems to map non-Western names to a Western alphabet, but they typical focus on one language, aren't unique (different newspapers write a Russian name differently), and change over time (China's capital went from Peking to Bejing).

    Note that transliteration can have interesting effects. Paris written in Chinese is 巴黎, which transliterated is BaLi. Now you don't want to send your Chinese customers who booked a romantic honeymoon in Paris to Indonesia.

Re: (OT) How to deal with non-ascii names
by jonadab (Parson) on Aug 13, 2010 at 17:34 UTC
    Is there any Perl module that helps translating things into ASCII in a reliable way?

    HTML::Entities. HTH.HAND.

    In all seriousness, I agree to a large extent with what the others have said. If you have to do this automatically (without getting a romanized version from the user), the transliteration method is going to need to be language-specific.

    For instance, for Japanese you might check out Lingua::JA::Hepburn::Passport. It doesn't appear to support kanji, but I'm not sure it's possible to automatically romanize kanji, since most of them have at least half a dozen different readings. The same character might romanize to "mei" in one name, "myo" or "myou" in another name, "min" in another, "a" in another, "aka" in another, "aki" in another (this is a real example). If you can't get furigana (pronunciation guide characters, usually kana) from the user, names are going to get romanized very incorrectly.

Re: (OT) How to deal with non-ascii names
by oko1 (Deacon) on Aug 13, 2010 at 18:52 UTC
    This application talks to reservation systems, some of them accept utf8 but some others -really old- only accept ASCII.

    I think that, rather than considering how to turn UTF8 into ASCII, the real problem you're facing is how to send your data to those "really old" systems in ways that they would find acceptable and readable. Some years ago, I did some work for a medical data processing company that had a very similar problem/question that they were trying to solve. The "solution" they had been trying to implement was a single "unified" format for everything (they'd spent a couple of years on this already, lost a number of customers, and weren't much further along than where they started.) My answer, though it sounded inefficient, was to simply forward all the "acceptably-formatted" data to the companies that could use it, and write custom converters for each of the rest. We were done in just under two months, and handled every single format (yes, it was done in Perl. :)

    In short, I suggest that you send the UTF8 data to the companies that are happy with it, then contact the ones that only accept ASCII and get the precise definition of how they want those issues handled (obviously, they have some way of doing so - and more importantly, have already decided how those issues are to be handled in their case, meaning that you don't have to reinvent that wheel.) Write the necessary converters for those cases, and send that data to those companies. Done deal.


    --
    "Language shapes the way we think, and determines what we can think about."
    -- B. L. Whorf

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://854933]
Approved by Corion
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others romping around the Monastery: (7)
As of 2014-07-26 05:22 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (175 votes), past polls