Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

Re: Re: Unicode and locales

by moxliukas (Curate)
on Nov 12, 2002 at 09:35 UTC ( #212231=note: print w/ replies, xml ) Need Help??


in reply to Re: Unicode and locales
in thread Unicode and locales

Unfortunately Encode needs perl version 5.7.3 (at least that's what perl -MCPAN -e 'install Encode' told me)

I will try Text::Iconv and see if that works.

Oh, and I have found the module Unicode::Map8. After reading the docs I am still not sure if it can be relevant to what I am doing. Can anyone enlighten me?

I guess I'll have to upgrade to 5.8 on FreeBSD machine. It is probably high time to do it anyway ;)


Comment on Re: Re: Unicode and locales
Download Code
Re: Re: Re: Unicode and locales
by mirod (Canon) on Nov 12, 2002 at 11:13 UTC

    Unicode::Map8 (you need Unicode::String too) also do conversions and they don't rely on iconv. This means that they are probably more portable, but likely slower than Text::Iconv. I usually use Text::Iconv.

    You might find converting character encodings useful, it shows you various methods to convert utf8 characters to latin1.

    Here is a version that does not use XML::Parser (adapting it to other encodings is left as a(n easy) exercice for the reader ;--):

    #!/bin/perl -w # converts XML data from UTF-8 back into latin1 # -r uses a regexp # -u uses Unicode::Strings # -i uses Text::Iconv (and the iconv library) # Note: -r does not work properly with XML::Parser 2.30 use strict; my $filter; if( $ARGV[0] eq '-r') { $filter = \&latin1; } elsif( $ARGV[0] eq '-u') { $filter= unicode_convert( 'latin1'); } elsif( $ARGV[0] eq '-i') { $filter= iconv_convert( 'latin1'); } else { die "usage: $0 [-r|-u|-i]"; } my $text= <DATA>; chomp $text; print "$text => ", $filter->( $text), "\n"; # shamelessly lifted from XML::TyePYX sub latin1 { my $text=shift; $text=~s{([\xc0-\xc3])(.)}{ my $hi = ord($1); my $lo = ord($2); chr((($hi & 0x03) <<6) | ($lo & 0x3F)) }ge; return $text; } sub unicode_convert { my $enc= shift; require Unicode::Map8; require Unicode::String; import Unicode::String qw(utf8); my $sub= eval q{ { my $cnv; sub { $cnv ||= new Unicode::Map8 ($enc) or die "Can't create converter"; return $cnv->to8 (utf8($_[0])->ucs2); } } }; return $sub; } sub iconv_convert { my $enc= shift; require Text::Iconv; my $sub= eval q{ { my $cnv; sub { $cnv ||= new Text::Iconv( 'utf8', $enc) or die "Can't create converter"; return $cnv->convert( $_[0]); } } }; return $sub; } __DATA__ texte soupçonné d'être plein de caractÚres accentués

      I think it's time for a benchmark here:

      Using perl 5.8.0, on Linux (Mandrake 9.0) on a rather fast machine (Athlon dual-processor 1.8):

      #!/bin/perl -w use strict; use Benchmark( 'cmpthese'); use Encode; use Text::Iconv; use Unicode::Map8; use Unicode::String qw(utf8); use utf8; my $enc= 'latin1'; my $convert_iconv = Text::Iconv->new( 'utf8', $enc); my $convert_unicode = Unicode::Map8->new ($enc); my $text= <DATA>; chomp $text; # lets just check the output! print "Encode : ", encode("iso-8859-1", $text), "\n"; print "Text::Iconv : ", $convert_iconv->convert( $text), "\n"; print "Unicode::Map8 : ", $convert_unicode->to8 (utf8($text)->ucs2), " +\n"; print "regexp : ", latin1( $text), "\n"; # now benchmark cmpthese( 500000, { 'Encode' => sub { encode("iso-8859-1", $text); + }, 'Text::Iconv' => sub { $convert_iconv->convert( $text +); }, 'Unicode::Map8' => sub { $convert_unicode->to8 (utf8($t +ext)->ucs2); }, 'regexp' => sub { latin1( $text); + }, }); sub latin1 { my $text=shift; $text=~s{([\xc0-\xc3])(.)}{ my $hi = ord($1); my $lo = ord($2); chr((($hi & 0x03) <<6) | ($lo & 0x3F)) }ge; return $text; } __DATA__ texte soupçonné d'être plein de caractÚres accentués

      Results:

      Encode : texte souponn d'tre plein de caractres accentus Text::Iconv : texte souponn d'tre plein de caractres accentus Unicode::Map8 : texte souponn d'tre plein de caractres accentus regexp : texte souponn d'tre plein de caractres accentus Benchmark: timing 500000 iterations of Encode, Text::Iconv, Unicode::M +ap8, regexp... Encode: 6 wallclock secs ( 4.91 usr + 0.02 sys = 4.93 CPU) @ + 101419.88/s (n=500000) Text::Iconv: 2 wallclock secs ( 2.20 usr + 0.00 sys = 2.20 CPU) @ + 227272.73/s (n=500000) Unicode::Map8: 7 wallclock secs ( 7.66 usr + 0.00 sys = 7.66 CPU) @ + 65274.15/s (n=500000) regexp: 6 wallclock secs ( 5.65 usr + 0.01 sys = 5.66 CPU) @ + 88339.22/s (n=500000) Rate Unicode::Map8 regexp Encode Tex +t::Iconv Unicode::Map8 65274/s -- -26% -36% + -71% regexp 88339/s 35% -- -13% + -61% Encode 101420/s 55% 15% -- + -55% Text::Iconv 227273/s 248% 157% 124% + --

      Note: I am not an expert in using Benchmark, so please let me know if my test is flawed.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://212231]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others cooling their heels in the Monastery: (6)
As of 2015-07-05 03:46 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (60 votes), past polls