No such thing as a small change

Re^2: UTF-8 and XML::Parser

by Anonymous Monk
on Oct 14, 2012 at 03:56 UTC

in reply to Re: UTF-8 and XML::Parser
in thread UTF-8 and XML::Parser

oh, i've overseen the

binmode STDOUT, ":encoding(UTF-8)";
line. well, this makes a difference. but i'm still puzzled, why i get utf-8 when i use ProtocolEncoding => 'ISO-8859-1' and which of those two ways does less math in encoding/decoding.


Re^3: UTF-8 and XML::Parser
by remiah (Hermit) on Oct 14, 2012 at 04:26 UTC

    Maybe, you saved your script with utf-8 encoding. If you save the script as iso-8859-1, you will get iso-8859-1 result.

    Below, is utf-8 saved script and 082-1 is iso-8859-1 saved script."" is "c3 bc" in utf-8. "fc" in iso-8859-1.

    >cat |perl -ne 'print $1 if m!<word>(.*?)</word>!' | hd 00000000 4d c3 bc 6c 6c 65 72 |M..ller| 00000007 >cat |perl -ne 'print $1 if m!<word>(.*?)</word>!' | hd 00000000 4d fc 6c 6c 65 72 |M.ller| 00000006 >

      i benchmarked the binmode variant against the utf8 open variant down here. i made an xml file with 100 lines and 32000 's (utf8) in each line ((P)CDATA). the below script did it in 0.20 seconds while the 'use utf8; / binmode' method take about 17.5 seconds.

      unfortunately perl crashes when i give a filehande to the parser while using the 'use open qw/:std :utf8/;' method when the file gets big. the 'use utf8; / binmode' method takes about 35 seconds when i pass the filehandle to the parser.

      output got redirected to /dev/null

      #!/usr/bin/perl use XML::Parser; #use utf8; use open qw/:std :utf8/; $ch = sub { my ($p, $w) = @_; # binmode STDOUT, ":encoding(UTF-8)"; print "$w\n"; }; $p = XML::Parser->new(ProtocolEncoding => 'UTF-8'); $p->setHandlers('Char' => $ch); my $xml = ""; open(F, '< x.xml'); while(<F>) { $xml .= $_; } $p->parse($xml); #$p->parse(*F); close(F);

        parsefile() has some trouble?

