Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

Re: UTF-8 and XML::Parser

by remiah (Hermit)
on Oct 14, 2012 at 02:35 UTC ( #998905=note: print w/ replies, xml ) Need Help??


in reply to UTF-8 and XML::Parser

Hello.

You need "use utf8;" for literal strings to be treated as utf8 "character".

#!/usr/bin/perl use XML::Parser; use utf8; ###added $xml = "<word>Müller</word>"; $ch = sub { my ($p, $w) = @_; binmode STDOUT, ":encoding(UTF-8)"; ###added print "#$w#\n"; }; # the next commented 2 lines mean the same # and translate the output to iso-8859-1 # $p = XML::Parser->new(); $p = XML::Parser->new(ProtocolEncoding => 'UTF-8'); # this line do the right job, but why?? #$p = XML::Parser->new(ProtocolEncoding => 'ISO-8859-1'); $p->setHandlers('Char' => $ch); $p->parse($xml);
From perluniintro ...

...if your Perl script itself is encoded in UTF-8, you can use UTF-8 in your identifier names, and in string and regular expression literals, by saying use utf8. This is not the default because scripts with legacy 8-bit data in them would break.

regards.


Comment on Re: UTF-8 and XML::Parser
Download Code
Replies are listed 'Best First'.
Re^2: UTF-8 and XML::Parser
by Anonymous Monk on Oct 14, 2012 at 03:56 UTC
    oh, i've overseen the
    binmode STDOUT, ":encoding(UTF-8)";
    line. well, this makes a difference. but i'm still puzzled, why i get utf-8 when i use ProtocolEncoding => 'ISO-8859-1' and which of those two ways does less math in encoding/decoding.

    greetings

      Maybe, you saved your script with utf-8 encoding. If you save the script as iso-8859-1, you will get iso-8859-1 result.

      Below, 082.pl is utf-8 saved script and 082-1 is iso-8859-1 saved script."ü" is "c3 bc" in utf-8. "fc" in iso-8859-1.

      >cat 082.pl |perl -ne 'print $1 if m!<word>(.*?)</word>!' | hd 00000000 4d c3 bc 6c 6c 65 72 |M..ller| 00000007 >cat 082-1.pl |perl -ne 'print $1 if m!<word>(.*?)</word>!' | hd 00000000 4d fc 6c 6c 65 72 |M.ller| 00000006 >

        i benchmarked the binmode variant against the utf8 open variant down here. i made an xml file with 100 lines and 32000 ü's (utf8) in each line ((P)CDATA). the below script did it in 0.20 seconds while the 'use utf8; / binmode' method take about 17.5 seconds.

        unfortunately perl crashes when i give a filehande to the parser while using the 'use open qw/:std :utf8/;' method when the file gets big. the 'use utf8; / binmode' method takes about 35 seconds when i pass the filehandle to the parser.

        output got redirected to /dev/null

        #!/usr/bin/perl use XML::Parser; #use utf8; use open qw/:std :utf8/; $ch = sub { my ($p, $w) = @_; # binmode STDOUT, ":encoding(UTF-8)"; print "$w\n"; }; $p = XML::Parser->new(ProtocolEncoding => 'UTF-8'); $p->setHandlers('Char' => $ch); my $xml = ""; open(F, '< x.xml'); while(<F>) { $xml .= $_; } $p->parse($xml); #$p->parse(*F); close(F);
Re^2: UTF-8 and XML::Parser
by Anonymous Monk on Oct 14, 2012 at 02:51 UTC
    what does 'use utf8;' has to do with this problem? and sure, i tried this with no luck. try it yourself. the quoted text just means i can use utf-8 encoded variable names, like:
    $Müller = 23;
      Not really. Try:
      perl -wE 'say length "Müller"; use utf8; say length "Müller";'
      لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://998905]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others imbibing at the Monastery: (4)
As of 2015-07-30 02:11 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (269 votes), past polls