Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

What Voodoo Encoding does RTF use for > ASCII Chars?

by tosh (Scribe)
on Mar 20, 2012 at 21:06 UTC ( #960656=perlquestion: print w/ replies, xml ) Need Help??
tosh has asked for the wisdom of the Perl Monks concerning the following question:

I'm using templates that will eventually create RTF docs. All is well until (stop me if you've heard this one before...) non-ASCII characters. Rerun!!

But wait!! This is actually a little bit different. According to the RTF specification on encoding there's two kinds. Either \'HEX or \uVOODOO. Why there's two kinds is beyond me, nor have I figured out when to use one or the other.

So while I struggle on with the above, I have this problem here:

If I create a RTF document in OS-X with the TextEdit program and put some nice accents in it, like say:

, , , ,

Then they are encoded in the document as: \'e0, \'e8, \'ec, \'f2, \'f9

Can anyone help me figure out by what witchcraft this was done, because straight up
$x =~ s/([\x00-\x1F\ x7F-\xFF])/"\\'" .(unpack("H2",$1))/eg;
Doe not work.

And don't get me started on using Unicode::Escape, it's too slow and also doesn't match the RTF specs.

Urgh!!

Tosh

Comment on What Voodoo Encoding does RTF use for > ASCII Chars?
Download Code
Re: What Voodoo Encoding does RTF use for > ASCII Chars?
by Corion (Pope) on Mar 20, 2012 at 21:12 UTC

    You will need to find out the encoding of your input data. The \u suggests to me that the characters are likely UTF-8-hex-encoded unicode code points. You will need to find out what encoding / codepage RTF actually uses, and encode to that target. See perluniintro, and whatever RTF spec.

    Also, it would be interesting to hear from you how Unicode::Escape fails for you and where it misses the RTF specs (and where the RTF specs are to be found). I don't find the Unicode::Escape documentation talking about RTF at all, so maybe there is some finer point I'm missing.

      \u is used for Unicode documents according to Wikipedia:
      http://en.wikipedia.org/wiki/Rich_Text_Format#Character_encoding

      \' is used for Windows1256 encoded, and no mention is made of Mac-encoded even tho' Word for Mac uses it.

      Unicode::Escape fails because it seems to encode higher order characters differently than RTF editors do.

      I don't care who is right, it's just not the same and so doesn't work, and of course this presents problems when I'm filling my templates with UTF8 data and trying to filter it. :(
Re: What Voodoo Encoding does RTF use for > ASCII Chars?
by Eliya (Vicar) on Mar 20, 2012 at 21:20 UTC
    $x =~ s/([\x00-\x1F\ x7F-\xFF])/"\\'" .(unpack("H2",$1))/eg; ^

    Assuming you're talking about Latin-1 characters (as the range up to just \xFF suggests), it should suffice to get rid of the indicated space...

    my $x = "foo , , , , bar"; $x =~ s/([\x00-\x1F\x7F-\xFF])/"\\'" .(unpack("H2",$1))/eg; print $x; # foo \'e0, \'e8, \'ec, \'f2, \'f9 bar
      Does it suffice for you? Because for me I get the following:
      my $x = "foo , , , , bar"; $x =~ s/([\x00-\x1F\x7F-\xFF])/"\\'" .(unpack("H2",$1))/eg; print $x; # WHAT I WANT: foo \'e0, \'e8, \'ec, \'f2, \'f9 bar # What I get foo \'c3\'a0, \'c3\'a8, \'c3\'ac, \'c3\'b2, \'c3\'b9 bar
      Vexing...

        Judging by your output, your problem is that your string (source code) is UTF-8 encoded, but you have not told Perl about it.

        use utf8 if your source code is in UTF-8.  In case the data comes from elsewhere, decode it properly before using it.

        The UTF-8 encoding of the character '' (for example) is the two bytes c3 a0, so if you don't tell Perl that those two bytes are supposed to be decoded into one character, you'll have them incorrectly interpreted as the two latin-1 characters \xc3 and \xa0.

Re: What Voodoo Encoding does RTF use for > ASCII Chars?
by GrandFather (Cardinal) on Mar 20, 2012 at 23:02 UTC

    RTF is fundamentally a ANSI text file. A character set specification is used to set the code page used and may be one of:

    • \ansi - the default
    • \mac - Apple Macintosh
    • \pc - IBM(R) PC Code Page 437
    • \pca - IBM PC Code Page 850

    However the encoded characters you show suggests that the code page being used is compatible with cp1252 (the wikipedia article's mention of cp1256 is misleading and wrong as far as I can tell). The following may be the voodoo you are looking for:

    use strict; use warnings; use Encode; my $x = "foo , , , , bar"; my $x1 = encode('cp1252', $x); $x1 =~ s<([\x00-\x1F\x7F-\xFF])> <"\\'" .(unpack("H2",$1))>eg; print $x1;
    True laziness is hard work
Re: What Voodoo Encoding does RTF use for > ASCII Chars?
by CountZero (Bishop) on Mar 20, 2012 at 23:19 UTC
    Did you have a look at RTF::Writer? It has an "escape" function.

    use Modern::Perl; use utf8; use RTF::Writer qw/rtfesc/; my $escaped = rtfesc('a e i o u'); say $escaped;
    Output:
    a\'e0 e\'e9 i\'ee o\'f6 u\'fc

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

    My blog: Imperial Deltronics
Re: What Voodoo Encoding does RTF use for > ASCII Chars?
by planetscape (Canon) on Mar 24, 2012 at 17:14 UTC

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://960656]
Approved by Corion
Front-paged by MidLifeXis
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others romping around the Monastery: (10)
As of 2014-08-22 11:23 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The best computer themed movie is:











    Results (156 votes), past polls