Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

Re: Decoding UTF-8 - "Cannot decode string with wide characters"

by shmem (Canon)
on Aug 24, 2006 at 17:29 UTC ( #569402=note: print w/ replies, xml ) Need Help??


in reply to Decoding UTF-8 - "Cannot decode string with wide characters"

Ah, those wide chars...

Google and Super Search only bring up dashed hopes.

<update2>
This problem arises from using the wrong conversion routine in PDF::API2 for hex chars representing a string in the PDF. In the case at hand, each printable character is followed by a NULL byte. Using

s/(..)/chr(hex($1))/ge; # convert 0x77 0x00 -> w^@
correctly translates them into a sequence of ASCII chars, each followed by a NULL byte, while
s/(....)/chr(hex($1))/ge # convert 0x7700 -> \234\200
leads to a UTF-8 string - chr() works for UTF-8 too - but 0x77 0x00 isn't the same as 0x7700. It's internal representation is "\347\234\200" (in octal notation) - three bytes. The only way to get it's original value back is using ord - which also works for UTF-8 :-)
</update2>

Try this silly sub

# stolen from Data::Dumper and tweaked.. ;-) sub narrow_char { return join('', map {chr(hex $_)} map{ (my $s = sprintf("%x",ord($_)))=~s/00$//; $s; } split//,$_[0] ); } my %info = ( 'CreationDate', 'D:20060817180621+01\'00\'', 'Producer', "\x{4f00}\x{7000}\x{6500}\x{6e00}\x{4f00}\x{6600}\x{6600}\x{6900}\ +x{6300}\x{6500}\x{2e00}\x{6f00}\x{7200}\x{6700}\x{2000}\x{3200}\x{2e0 +0}\x{3000}", 'Creator', "\x{5700}\x{7200}\x{6900}\x{7400}\x{6500}\x{7200}", 'Author', "\x{4200}\x{6f00}\x{6200}\x{2000}\x{5700}\x{6500}\x{6200}\x{7300}\ +x{7400}\x{6500}\x{7200}", 'Title', "\x{4300}\x{4f00}\x{4d00}\x{5000}\x{4500}\x{5400}\x{4900}\x{5400}\ +x{4900}\x{5600}\x{4500}\x{2000}\x{5300}\x{4100}\x{4600}\x{4100}\x{520 +0}\x{4900}", ); foreach my $key (sort keys %info) { print "$key -> "; print narrow_char($info{$key}); print "\n"; } __END__ # output: Author -> Bob Webster CreationDate -> D:20060817180621+01'00' Creator -> Writer Producer -> OpenOffice.org 2.0 Title -> COMPETITIVE SAFARI

whenever decode barfs...

update: Another way (less silly?) (regexp by mtve):

sub narrow_char { $_[0] =~ s/(.)/chr(ord($1)>>8)/eg if (length($_[0]) * 3 == do { use bytes; length $_[0] } ); $_[0]; }

<update2>
Caveat: This routine modifies $_[0] in-place, so it's value is changed in the caller as well. </update2>

--shmem

update: populated solution with strings from OP

update2: added some explanation

_($_=" "x(1<<5)."?\n".q/)Oo.  G\        /
                              /\_/(q    /
----------------------------  \__(m.====.(_("always off the crowd"))."
");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}


Comment on Re: Decoding UTF-8 - "Cannot decode string with wide characters"
Select or Download Code

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://569402]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others rifling through the Monastery: (12)
As of 2015-07-30 15:53 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (272 votes), past polls