Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

Re: Write special chars to PDF. UTF8?

by kcott (Archbishop)
on Feb 12, 2016 at 13:07 UTC ( [id://1155072]=note: print w/replies, xml ) Need Help??


in reply to Write special chars to PDF. UTF8?

G'day tel2,

"I'm guessing I might need to "use utf8", ..."

Sorry, but that would be a bad guess. The documentation for the utf8 pragma states, in emboldened text:

"Do not use this pragma for anything else than telling Perl that your script is written in UTF-8."

Your basic problem here is that the filehandle, FILE, doesn't know about the UTF-8. Example of what's happening:

$ perl -Mutf8 -wE 'say "e-acute: é; u-acute: ú"' e-acute: ?; u-acute: ?

Here's three ways to address this problem:

  • Use the binmode function, e.g.
    $ perl -Mutf8 -wE 'binmode STDOUT => ":utf8"; say "e-acute: é; u-acute +: ú"' e-acute: é; u-acute: ú
  • Use the open pragma, e.g.
    $ perl -Mutf8 -wE 'use open OUT => qw{:utf8 :std}; say "e-acute: é; u- +acute: ú"' e-acute: é; u-acute: ú
  • Use the 3-argument form of the open function and specify the encoding in the mode. Something like this:
    open my $fh, '>:encoding(UTF-8)', $filename

Here's some recommendations for your code. This is unrelated to the UTF-8 issue.

  • Let Perl tell you about problems. Start using the strict pragma and the warnings pragma.
  • Your code is littered with package variables: $cgi, an object reference; $f1, a string; FILE, a filehandle; and so on. These are all global and suffer from the same problems as all global variables. Start using lexical variables, and control their scope, for far less error-prone code. There's a lot of information about this in perlsub; the "Private Variables via my()" section would be a good place to start.
  • Don't use indirect object syntax, e.g. code like new CGI. Here's what perlobj: Invoking Class Methods says, in emboldened text, at the start of the Indirect Object Syntax section:
    "Outside of the file handle case, use of this syntax is discouraged as it can confuse the Perl interpreter. See below for more details."
  • Start using lexical filehandles with the 3-argument form of the open function. See that document for more about this.
  • Hand-crafting I/O die messages is time-consuming and error-prone. Let Perl do this task for you with the autodie pragma. You can then write code like this:
    use autodie; ... open my $in_fh, '<', $infile; open my $out_fh, '>', $outfile;

— Ken

Replies are listed 'Best First'.
Re^2: Write special chars to PDF. UTF8?
by tel2 (Pilgrim) on Feb 12, 2016 at 23:24 UTC
    G'day from across the ditch, Ken.  You're talkin' my language, mate.

    Thanks very much for your time and all your tips.

    The reason I wrote $f1 to the file and read it back into $f2 was just to make sure the variables weren't changing in the process, and from what I can tell they aren't. I'm struggling to understand how this issue is about writing/reading the file. My reasons are:

    1. If I remove my "quick hack" and change the webpage's "Output: $f1" line to "Output: $f2" (which it was meant to be originally - sorry), the e-acute appears on the webpage correctly.

    2. If I print $f1 (which has not been read from a file) to the PDF (e.g. $text->text("PDF Output:$f1=$f2");) no acutes appear correctly.

    3. If I write $f1 to a file as you have suggested, and read it back into $f3, it then contains more bytes than $f1, and printing $f3 to the PDF still doesn't print e-acute properly.

    Below is some modified code which demonstrates this (sorry, I haven't brought it into the general coding standards you've suggested at this stage).

    #!/usr/bin/perl use lib "/home/tospeirs/perl5/lib/perl5"; use CGI; use PDF::API2; use bytes; use constant mm => 25.4 / 72; $cgi = new CGI; $f1 = $cgi->param(f1); if (defined($f1)) { open (FILE, ">utf8_test1.out") or die "Can't open outfile"; print FILE $f1; close FILE; open (FILE, "<utf8_test1.out") or die "Can't open infile"; $f2 = <FILE>; close FILE; open my $fh, '>:encoding(UTF-8)', 'utf8_test2.out'; print $fh $f1; close $fh; open my $fh, '<:encoding(UTF-8)', 'utf8_test2.out'; $f3 = <$fh>; close $fh; $lengths = "Lengths: f1=" . bytes::length($f1) . ", f2=" . byt +es::length($f2) . ", f3=" . bytes::length($f3); $cmp = ($f1 eq $f2) ? 'f1=f2' : 'f1<>f2'; $cmp .= ($f1 eq $f3) ? ', f1=f3' : ', f1<>f3'; $pdf = PDF::API2->new(); $font1 = $pdf->corefont('Arial'); $page = $pdf->page; # Add blank page $page->mediabox(210/mm, 297/mm); $text = $page->text(); $text->font($font1, 28); $text->translate(5/mm ,280/mm); # A quick hack to handle a couple of special chars #$f2 =~ s/\303\251/\351/g; # e-acute #$f2 =~ s/\303\272/\372/g; # u-acute $text->text("PDF Output:$f1=$f2=$f3"); $pdf->saveas('utf8_test1.pdf'); } print <<EOF; Content-Type: text/html; charset=utf-8\n <!DOCTYPE html> <html lang='en-NZ'> <head> <title>Test UTF-8</title> <meta charset='UTF-8'> </head> <body> <form method='post'> Input: <input type='text' name='f1' value='$f1'> <br> <input type='submit' name='submit' value='Submit'> <br> Output f2: $f2 <br> Output f3: $f3 <br> $lengths <br> $cmp </form> </body> </html> EOF
    This is what I see on the webpage after I submit "Cliché.":
    Input: Cliché.
    Submit
    Output f2: Cliché.
    Output f3: Cliché.
    Lengths: f1=8, f2=8, f3=10
    f1=f2, f1<>f3
    
    And the PDF ends up containing this:
    PDF Output:Cliché.=Cliché.=Cliché.
    
    As you can see, none of those 3 came out right in the PDF, and the $f3 looks extra long, as if it's been double-encoded or something.  Check this octal dump out:
    $ od -c utf8_test1.out
    0000000   C   l   i   c   h 303 251   .
    $ od -c utf8_test2.out
    0000000   C   l   i   c   h 303 203 302 251   .
    
    Any ideas?

    Thanks.
    tel2

      Try using decode() for the pdf

      #!/perl use strict; use warnings; use CGI; use CGI::Carp 'fatalsToBrowser'; use PDF::API2; use Encode; my $cgi = new CGI; my $f1 = $cgi->param('f1'); my $f2 = decode('UTF-8', $f1 ); open OUT,'>','c:/temp/web/pdf.txt' or die; # change path to suit print OUT "$f1 $f2"; close OUT; my $pdf = PDF::API2->new()->mediabox('A4'); my $text = $pdf->page->text; my $font1 = $pdf->corefont('Arial'); $text->font($font1, 36); $text->translate(100,500); $text->text("f1 = $f1"); $text->translate(100,600); $text->text("f2 = $f2"); $pdf->saveas('c:/temp/web/utf8_test1.pdf'); # change path to suit print <<EOF; Content-Type: text/html; charset=UTF-8\n <!DOCTYPE html> <html lang='en-NZ'> <head> <title>Test UTF-8</title> <meta charset="UTF-8"> </head><body> $f1 $f2 <form method="post"> Input: <input type="text" name="f1" value="$f1"><br> <input type="submit" name="submit" value="Submit"> </form></body></html> EOF
      poj
        Thank you very much for that code, poj!

        That's working for me.

        tel2

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1155072]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others imbibing at the Monastery: (4)
As of 2024-04-23 21:12 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found