Re: Decoding UTF-8 - "Cannot decode string with wide characters"

by shmem (Chancellor)
on Aug 24, 2006 at 17:29 UTC ( #569402=note:

in reply to Decoding UTF-8 - "Cannot decode string with wide characters"

Ah, those wide chars...

Google and Super Search only bring up dashed hopes.

This problem arises from using the wrong conversion routine in PDF::API2 for hex chars representing a string in the PDF. In the case at hand, each printable character is followed by a NULL byte. Using

s/(..)/chr(hex($1))/ge; # convert 0x77 0x00 -> w^@
correctly translates them into a sequence of ASCII chars, each followed by a NULL byte, while
s/(....)/chr(hex($1))/ge # convert 0x7700 -> \234\200
leads to a UTF-8 string - chr() works for UTF-8 too - but 0x77 0x00 isn't the same as 0x7700. It's internal representation is "\347\234\200" (in octal notation) - three bytes. The only way to get it's original value back is using ord - which also works for UTF-8 :-)

Try this silly sub

# stolen from Data::Dumper and tweaked.. ;-) sub narrow_char { return join('', map {chr(hex $_)} map{ (my $s = sprintf("%x",ord($_)))=~s/00$//; $s; } split//,$_[0] ); } my %info = ( 'CreationDate', 'D:20060817180621+01\'00\'', 'Producer', "\x{4f00}\x{7000}\x{6500}\x{6e00}\x{4f00}\x{6600}\x{6600}\x{6900}\ +x{6300}\x{6500}\x{2e00}\x{6f00}\x{7200}\x{6700}\x{2000}\x{3200}\x{2e0 +0}\x{3000}", 'Creator', "\x{5700}\x{7200}\x{6900}\x{7400}\x{6500}\x{7200}", 'Author', "\x{4200}\x{6f00}\x{6200}\x{2000}\x{5700}\x{6500}\x{6200}\x{7300}\ +x{7400}\x{6500}\x{7200}", 'Title', "\x{4300}\x{4f00}\x{4d00}\x{5000}\x{4500}\x{5400}\x{4900}\x{5400}\ +x{4900}\x{5600}\x{4500}\x{2000}\x{5300}\x{4100}\x{4600}\x{4100}\x{520 +0}\x{4900}", ); foreach my $key (sort keys %info) { print "$key -> "; print narrow_char($info{$key}); print "\n"; } __END__ # output: Author -> Bob Webster CreationDate -> D:20060817180621+01'00' Creator -> Writer Producer -> 2.0 Title -> COMPETITIVE SAFARI

whenever decode barfs...

update: Another way (less silly?) (regexp by mtve):

sub narrow_char { $_[0] =~ s/(.)/chr(ord($1)>>8)/eg if (length($_[0]) * 3 == do { use bytes; length $_[0] } ); $_[0]; }

Caveat: This routine modifies $_[0] in-place, so it's value is changed in the caller as well. </update2>


update: populated solution with strings from OP

update2: added some explanation

node history
Node Type: note [id://569402]
As of 2017-10-23 12:44 GMT
