Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Unicode encoding

by Anonymous Monk
on May 14, 2009 at 07:42 UTC ( #764018=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi,
I am writing a script to parse a text file containing characters like "\u00E1", "\u015F" etc.
Can you please suggest a way to (de|en)code these characters so that the output text contains readable characters?
Thanks

Replies are listed 'Best First'.
Re: Unicode encoding
by moritz (Cardinal) on May 14, 2009 at 07:47 UTC
    my $str = '\u00E1\u015F'; $str =~ s/\\u([0-9a-fA-F]{4,6})/chr(hex($1))/eg; open my $handle, '>:encoding(UTF-8)', $file or die "Can't open file `$file' for writing: $!"; print $handle $str; close $handle or warn $!;

    See also Character encodings and Unicode in Perl.

      $str =~ s/\\u([0-9a-fA-F]{4,6})/chr(hex($1))/eg;

      It looks like Java unicode escapes. These always have exactly 4 digits as far as I've seen. \u20AC80 would mean 80, but your example would not see it that way :)

      s/\\u([0-9A-Fa-f]{4})/chr hex $1/ge;

      or die "Can't open file `$file' for writing: $!";

      Backticks and apostrophes as balanced quotes are ugly, and not at all balanced, in just about any font.

Re: Unicode encoding
by cdarke (Prior) on May 14, 2009 at 08:23 UTC
    Also bear in mind that it depends how you are displaying the characters when you say readable. For example, Microsoft's cmd.exe has very poor non-English character support, and if you are using an xterm emulator then you might have to alter the character set supported.
      Actually you can change the code page to utf8 for cmd.exe (for $^O >= win2k):
      system(chcp => 65001, '2>nul', '1>nul')
      But you'll also need a font to display the chars :) Lucida Console can't display everything AFAIK. However, it'll be enough for your local code page. It's good enough for Turkish for example.

        But you'll also need a font to display the chars :) Lucida Console can't display everything AFAIK.

        No font contains every glyph that Unicode supports. That's why good software supports falling back to other fonts; cmd.exe is not good software in this regard :)

        Your system call is wrong. You're telling Perl to pass '2>nul' and '1>nul' as parameters to chcp while they only make sense in a command passed to the shell. The call should be

        system('chcp 65001 2>nul 1>nul')

        which is short for

        system(cmd => ( '/c' => 'chcp 65001 2>nul 1>nul' ) )

        If you want to avoid calling the shell, you need the following:

        open(my $fh, '>', 'nul') or die "open nul: $!\n"; my $pid = open3( undef, # Use parent's STDIN '>&'.fileno($fh), # STDOUT = nul undef, # STDERR = STDOUT chcp => 65001, ); waitpid($pid, 0);

        Now, system is buggy on Windows, so your code might actually function as you intend it to. But if it does, you're relying on a bug in Perl.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://764018]
Approved by moritz
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others exploiting the Monastery: (6)
As of 2019-10-22 08:50 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Notices?