How to convert grabled characters into their real value

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: How to convert grabled characters into their real value by Khen1950fx (Canon) on Nov 26, 2012 at 15:58 UTC
Is the Thai correct? I tried it and got "tmyamkung". `#!/usr/bin/perl -l use utf8; use strict; use warnings; use Text::Unidecode; print unidecode("ต้มยำกุ&#36 +57;ง"); print unidecode( "\x{e15}\x{e49}\x{e21}\x{e22}\x{e33}\x{e01}\x{e38}\x{e49}\x{e07}" );` [download] Update: ignore the first print.	[reply] [d/l]
Re^2: How to convert grabled characters into their real value by Your Mother (Archbishop) on Nov 26, 2012 at 18:18 UTC
use utf8; use strict; use warnings; use Text::Unidecode; print unidecode("ต้มยำกุ้ง"), $/; When necessary, there are still `<pre/>` tags for this stuff. If it's short, there's no real problem (well, no download link but…).	[reply] [d/l]
Re^3: How to convert grabled characters into their real value by rcrews (Novice) on Nov 26, 2012 at 19:23 UTC
The unsaid base problem here is that someone started with the string "ต้มยำกุ้ง" but then incorrectly decoded it (probably to Windows cp1258) to create a new string of "ต้มยำกุ้ง" Note that in UTF-8, each of the nine charcters in the Thai string takes three bytes. Therefore the Latin 1 decoding includes 9 x 3 = 27 characters and each triplet begins with �. It is often the case that an incorrectly decoded UTF-8 string into a Latin 1 character set will show each original character as beginning with some accented form of the letter a or A. I was not able to repair the string in place, but writing it to a file, I can use Perl's IO Layers and the Encode module to repair the encoding. `use strict; use Encode; $\|++; my $t = 'thai.txt'; # contains => ต้มยำกุ้ง open my $fh, '<:raw', $t or die "Couldn't open $t: $!"; my $content = do { local $/; <$fh> }; close $fh; $content = decode('UTF-8', $content); binmode *STDOUT, ':encoding(UTF-8)'; print "$content\n";` [download] However, note that rather than writing this program you can use Perl's wonderful character encoder/decoder without writing any code: `piconv -t UTF-8 thai.txt > thai_fixed.txt` [download]	[reply] [d/l] [select]
Re: How to convert grabled characters into their real value by ColonelPanic (Friar) on Nov 26, 2012 at 15:29 UTC
More context would be helpful. Where is your input string coming from, and what do you mean by "real value"? Text::Unidecode only aims to provide a rough ASCII transliteration of the underlying characters, and it is self-admittedly quite bad at Thai. Is this really what you want? Update: also, if you don't know about it already, be sure to check out Encode, the core Perl module that deals with character encodings. When's the last time you used duct tape on a duct? --Larry Wall	[reply]
Re: How to convert grabled characters into their real value by Anonymous Monk on Nov 26, 2012 at 15:05 UTC
Can you use Data::Dump and dump your utf string, so we can get its real bytes?	[reply]
Re: How to convert grabled characters into their real value by Anonymous Monk on Nov 26, 2012 at 23:14 UTC
When you know that you are dealing with international character sets (e.g. Unicode) then always remember that you are looking "through a glass, darkly." "The output" "is garbled" it would seem be-cause it is being displayed to you in the wrong character-set: an English character-set, not Thai. But the bytes might well not be "garbled" or "incorrect" at all! In fact they probably are not. If you send a string to the `unidecode()` function, do you know that the bytes are the same? Most likely they are not ... you're sending it a quoted character-string which means that some decoding is going to happen to produce "the bytes." You have got to get straight to the computer's level ... to a stream of bytes. p	[reply]


laziness, impatience, and hubris
	PerlMonks