Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

Re: Decoding UTF-8 - "Cannot decode string with wide characters"

by shmem (Canon)
on Aug 24, 2006 at 17:29 UTC ( #569402=note: print w/ replies, xml ) Need Help??


in reply to Decoding UTF-8 - "Cannot decode string with wide characters"

Ah, those wide chars...

Google and Super Search only bring up dashed hopes.

<update2>
This problem arises from using the wrong conversion routine in PDF::API2 for hex chars representing a string in the PDF. In the case at hand, each printable character is followed by a NULL byte. Using

s/(..)/chr(hex($1))/ge; # convert 0x77 0x00 -> w^@
correctly translates them into a sequence of ASCII chars, each followed by a NULL byte, while
s/(....)/chr(hex($1))/ge # convert 0x7700 -> \234\200
leads to a UTF-8 string - chr() works for UTF-8 too - but 0x77 0x00 isn't the same as 0x7700. It's internal representation is "\347\234\200" (in octal notation) - three bytes. The only way to get it's original value back is using ord - which also works for UTF-8 :-)
</update2>

Try this silly sub

# stolen from Data::Dumper and tweaked.. ;-) sub narrow_char { return join('', map {chr(hex $_)} map{ (my $s = sprintf("%x",ord($_)))=~s/00$//; $s; } split//,$_[0] ); } my %info = ( 'CreationDate', 'D:20060817180621+01\'00\'', 'Producer', "\x{4f00}\x{7000}\x{6500}\x{6e00}\x{4f00}\x{6600}\x{6600}\x{6900}\ +x{6300}\x{6500}\x{2e00}\x{6f00}\x{7200}\x{6700}\x{2000}\x{3200}\x{2e0 +0}\x{3000}", 'Creator', "\x{5700}\x{7200}\x{6900}\x{7400}\x{6500}\x{7200}", 'Author', "\x{4200}\x{6f00}\x{6200}\x{2000}\x{5700}\x{6500}\x{6200}\x{7300}\ +x{7400}\x{6500}\x{7200}", 'Title', "\x{4300}\x{4f00}\x{4d00}\x{5000}\x{4500}\x{5400}\x{4900}\x{5400}\ +x{4900}\x{5600}\x{4500}\x{2000}\x{5300}\x{4100}\x{4600}\x{4100}\x{520 +0}\x{4900}", ); foreach my $key (sort keys %info) { print "$key -> "; print narrow_char($info{$key}); print "\n"; } __END__ # output: Author -> Bob Webster CreationDate -> D:20060817180621+01'00' Creator -> Writer Producer -> OpenOffice.org 2.0 Title -> COMPETITIVE SAFARI

whenever decode barfs...

update: Another way (less silly?) (regexp by mtve):

sub narrow_char { $_[0] =~ s/(.)/chr(ord($1)>>8)/eg if (length($_[0]) * 3 == do { use bytes; length $_[0] } ); $_[0]; }

<update2>
Caveat: This routine modifies $_[0] in-place, so it's value is changed in the caller as well. </update2>

--shmem

update: populated solution with strings from OP

update2: added some explanation

_($_=" "x(1<<5)."?\n".q/)Oo.  G\        /
                              /\_/(q    /
----------------------------  \__(m.====.(_("always off the crowd"))."
");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}


Comment on Re: Decoding UTF-8 - "Cannot decode string with wide characters"
Select or Download Code

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://569402]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others romping around the Monastery: (5)
As of 2014-08-30 20:33 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The best computer themed movie is:











    Results (293 votes), past polls