Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

Re: How to convert grabled characters into their real value

by Khen1950fx (Canon)
on Nov 26, 2012 at 15:58 UTC ( #1005700=note: print w/ replies, xml ) Need Help??


in reply to How to convert grabled characters into their real value

Is the Thai correct? I tried it and got "tmyamkung".

#!/usr/bin/perl -l use utf8; use strict; use warnings; use Text::Unidecode; print unidecode("ต้มยำกุ&#36 +57;ง"); print unidecode( "\x{e15}\x{e49}\x{e21}\x{e22}\x{e33}\x{e01}\x{e38}\x{e49}\x{e07}" );
Update: ignore the first print.


Comment on Re: How to convert grabled characters into their real value
Download Code
Re^2: How to convert grabled characters into their real value
by Your Mother (Canon) on Nov 26, 2012 at 18:18 UTC
    use utf8;
    use strict;
    use warnings;
    use Text::Unidecode;
    
    print unidecode("ต้มยำกุ้ง"), $/;
    

    When necessary, there are still <pre/> tags for this stuff. If it's short, there's no real problem (well, no download link but…).

      The unsaid base problem here is that someone started with the string "ต้มยำกุ้ง" but then incorrectly decoded it (probably to Windows cp1258) to create a new string of "ต้มยำกุ้ง"

      Note that in UTF-8, each of the nine charcters in the Thai string takes three bytes. Therefore the Latin 1 decoding includes 9 x 3 = 27 characters and each triplet begins with . It is often the case that an incorrectly decoded UTF-8 string into a Latin 1 character set will show each original character as beginning with some accented form of the letter a or A.

      I was not able to repair the string in place, but writing it to a file, I can use Perl's IO Layers and the Encode module to repair the encoding.

      use strict; use Encode; $|++; my $t = 'thai.txt'; # contains => ต้มยำกุ้ง open my $fh, '<:raw', $t or die "Couldn't open $t: $!"; my $content = do { local $/; <$fh> }; close $fh; $content = decode('UTF-8', $content); binmode *STDOUT, ':encoding(UTF-8)'; print "$content\n";

      However, note that rather than writing this program you can use Perl's wonderful character encoder/decoder without writing any code:

      piconv -t UTF-8 thai.txt > thai_fixed.txt

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1005700]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others browsing the Monastery: (4)
As of 2014-08-01 05:50 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (256 votes), past polls