G'day from across the ditch, Ken. You're talkin' my language, mate.
Thanks very much for your time and all your tips.
The reason I wrote $f1 to the file and read it back into $f2 was just to make sure the variables weren't changing in the process, and from what I can tell they aren't. I'm struggling to understand how this issue is about writing/reading the file. My reasons are:
1. If I remove my "quick hack" and change the webpage's "Output: $f1" line to "Output: $f2" (which it was meant to be originally - sorry), the e-acute appears on the webpage correctly.
2. If I print $f1 (which has not been read from a file) to the PDF (e.g. $text->text("PDF Output:$f1=$f2");) no acutes appear correctly.
3. If I write $f1 to a file as you have suggested, and read it back into $f3, it then contains more bytes than $f1, and printing $f3 to the PDF still doesn't print e-acute properly.
Below is some modified code which demonstrates this (sorry, I haven't brought it into the general coding standards you've suggested at this stage).
#!/usr/bin/perl
use lib "/home/tospeirs/perl5/lib/perl5";
use CGI;
use PDF::API2;
use bytes;
use constant mm => 25.4 / 72;
$cgi = new CGI;
$f1 = $cgi->param(f1);
if (defined($f1))
{
open (FILE, ">utf8_test1.out") or die "Can't open outfile";
print FILE $f1;
close FILE;
open (FILE, "<utf8_test1.out") or die "Can't open infile";
$f2 = <FILE>;
close FILE;
open my $fh, '>:encoding(UTF-8)', 'utf8_test2.out';
print $fh $f1;
close $fh;
open my $fh, '<:encoding(UTF-8)', 'utf8_test2.out';
$f3 = <$fh>;
close $fh;
$lengths = "Lengths: f1=" . bytes::length($f1) . ", f2=" . byt
+es::length($f2) . ", f3=" . bytes::length($f3);
$cmp = ($f1 eq $f2) ? 'f1=f2' : 'f1<>f2';
$cmp .= ($f1 eq $f3) ? ', f1=f3' : ', f1<>f3';
$pdf = PDF::API2->new();
$font1 = $pdf->corefont('Arial');
$page = $pdf->page; # Add blank page
$page->mediabox(210/mm, 297/mm);
$text = $page->text();
$text->font($font1, 28);
$text->translate(5/mm ,280/mm);
# A quick hack to handle a couple of special chars
#$f2 =~ s/\303\251/\351/g; # e-acute
#$f2 =~ s/\303\272/\372/g; # u-acute
$text->text("PDF Output:$f1=$f2=$f3");
$pdf->saveas('utf8_test1.pdf');
}
print <<EOF;
Content-Type: text/html; charset=utf-8\n
<!DOCTYPE html>
<html lang='en-NZ'>
<head>
<title>Test UTF-8</title>
<meta charset='UTF-8'>
</head>
<body>
<form method='post'>
Input: <input type='text' name='f1' value='$f1'>
<br>
<input type='submit' name='submit' value='Submit'>
<br>
Output f2: $f2
<br>
Output f3: $f3
<br>
$lengths
<br>
$cmp
</form>
</body>
</html>
EOF
This is what I see on the webpage after I submit "Cliché.":
Input: Cliché.
Submit
Output f2: Cliché.
Output f3: Cliché.
Lengths: f1=8, f2=8, f3=10
f1=f2, f1<>f3
And the PDF ends up containing this:
PDF Output:Cliché.=Cliché.=Cliché.
As you can see, none of those 3 came out right in the PDF, and the $f3 looks extra long, as if it's been double-encoded or something. Check this octal dump out:
$ od -c utf8_test1.out
0000000 C l i c h 303 251 .
$ od -c utf8_test2.out
0000000 C l i c h 303 203 302 251 .
Any ideas?
Thanks.
tel2 |