Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

How to convert grabled characters into their real value

by Anonymous Monk
on Nov 26, 2012 at 14:59 UTC ( #1005683=perlquestion: print w/ replies, xml ) Need Help??
Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

My program is like this:
use utf8; use Text::Unidecode; print unidecode("ต้มยำกุ้ง");

But the output is the same: ต้มยำกุ้ง This ต้มยำกุ้ง is a grabled character of thai word Tom Yum with Prawns.

Can anybody tell me exact way to do this?

Comment on How to convert grabled characters into their real value
Download Code
Re: How to convert grabled characters into their real value
by Anonymous Monk on Nov 26, 2012 at 15:05 UTC

    Can you use Data::Dump and dump your utf string, so we can get its real bytes?

Re: How to convert grabled characters into their real value
by ColonelPanic (Friar) on Nov 26, 2012 at 15:29 UTC

    More context would be helpful. Where is your input string coming from, and what do you mean by "real value"?

    Text::Unidecode only aims to provide a rough ASCII transliteration of the underlying characters, and it is self-admittedly quite bad at Thai. Is this really what you want?

    Update: also, if you don't know about it already, be sure to check out Encode, the core Perl module that deals with character encodings.



    When's the last time you used duct tape on a duct? --Larry Wall
Re: How to convert grabled characters into their real value
by Khen1950fx (Canon) on Nov 26, 2012 at 15:58 UTC
    Is the Thai correct? I tried it and got "tmyamkung".
    #!/usr/bin/perl -l use utf8; use strict; use warnings; use Text::Unidecode; print unidecode("ต้มยำกุ&#36 +57;ง"); print unidecode( "\x{e15}\x{e49}\x{e21}\x{e22}\x{e33}\x{e01}\x{e38}\x{e49}\x{e07}" );
    Update: ignore the first print.
      use utf8;
      use strict;
      use warnings;
      use Text::Unidecode;
      
      print unidecode("ต้มยำกุ้ง"), $/;
      

      When necessary, there are still <pre/> tags for this stuff. If it's short, there's no real problem (well, no download link but…).

        The unsaid base problem here is that someone started with the string "ต้มยำกุ้ง" but then incorrectly decoded it (probably to Windows cp1258) to create a new string of "ต้มยำกุ้ง"

        Note that in UTF-8, each of the nine charcters in the Thai string takes three bytes. Therefore the Latin 1 decoding includes 9 x 3 = 27 characters and each triplet begins with . It is often the case that an incorrectly decoded UTF-8 string into a Latin 1 character set will show each original character as beginning with some accented form of the letter a or A.

        I was not able to repair the string in place, but writing it to a file, I can use Perl's IO Layers and the Encode module to repair the encoding.

        use strict; use Encode; $|++; my $t = 'thai.txt'; # contains => ต้มยำกุ้ง open my $fh, '<:raw', $t or die "Couldn't open $t: $!"; my $content = do { local $/; <$fh> }; close $fh; $content = decode('UTF-8', $content); binmode *STDOUT, ':encoding(UTF-8)'; print "$content\n";

        However, note that rather than writing this program you can use Perl's wonderful character encoder/decoder without writing any code:

        piconv -t UTF-8 thai.txt > thai_fixed.txt
Re: How to convert grabled characters into their real value
by Anonymous Monk on Nov 26, 2012 at 23:14 UTC

    When you know that you are dealing with international character sets (e.g. Unicode) then always remember that you are looking "through a glass, darkly."

    "The output" "is garbled" it would seem be-cause it is being displayed to you in the wrong character-set: an English character-set, not Thai. But the bytes might well not be "garbled" or "incorrect" at all! In fact they probably are not.

    If you send a string to the unidecode() function, do you know that the bytes are the same? Most likely they are not ... you're sending it a quoted character-string which means that some decoding is going to happen to produce "the bytes."

    You have got to get straight to the computer's level ... to a stream of bytes.

    p

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1005683]
Front-paged by Arunbear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others having an uproarious good time at the Monastery: (8)
As of 2014-12-23 01:31 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    Is guessing a good strategy for surviving in the IT business?





    Results (133 votes), past polls