Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Writing HTML file with UTF-8 chars

by cormanaz (Chaplain)
on Apr 27, 2013 at 15:56 UTC ( #1030979=perlquestion: print w/ replies, xml ) Need Help??
cormanaz has asked for the wisdom of the Perl Monks concerning the following question:

Good day bros. I have some URLs and page titles that contain UTF-8 chars (French and Arabix). I am trying to write them to an HTML file like so:
Use Encode; . . . open(OUT,">topurls.htm") or die "Can't open output: $!"; print OUT <<'END_HEADER'; <html> <head> <title>Top URLs</title> <meta http-equiv="Content-Type" content="text/html; ch +arset=utf-8"> </head> <body> <h3>Top URLs</h3> <table cellpadding=10 border=1><tr><th>Link</th><th>Co +unt</th><th>Users</th></tr> END_HEADER foreach my $u (keys %topurls) { my @line; $line[0] = '<a target="_blank" href="'.encode('UTF-8',$u).'">'.enc +ode('UTF-8',$topurls{$u}{title}).'</a>'; $line[1] = $topurls{$u}{count}; $line[2] = $topurls{$u}{users}; print OUT '<tr><td>'.join('</td><td>',@line).'</td></tr>'."\n"; } print OUT '</table></body></html>'; close OUT;
However in the resulting files, the characters are in the windoze charset. On the advice of a post somewhere else, I tried adding binmode(OUT,":utf8"); after the file open, no help. I also tried wrapping the encode statements in encode_entities from HTML::Entities. Also no help. How do I get these chars to output properly?

Comment on Writing HTML file with UTF-8 chars
Select or Download Code
Re: Writing HTML file with UTF-8 chars
by choroba (Abbot) on Apr 27, 2013 at 16:19 UTC
    Without knowing what is in %topurls, we cannot help you much. It seems the hash contains strings in an encoding other than UTF-8. How do you populate %topurls?
    لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ
      It is populated from a query to a PgSQL DB. I'm sure the chars in the db are UTF-8, in fact I've written them to an Excel sheet using encode() and it worked fine. Also when I replace the loop in the example with
      foreach my $u (keys %topurls) { my @line; $line[0] = $u; $line[1] = $topurls{$u}{title}; $line[2] = $topurls{$u}{count}; $line[3] = $topurls{$u}{users}; print join("\t",@line)."\n"; }
      It prints correctly in the debugger (Komodo) output window.
        I'm sure the chars in the db are UTF-8...

        Have you set the pg_enable_utf8 flag in DBD::Pg? Having valid UTF-8 is important, but so is telling Perl which encoding to use to interpret the incoming data.

        In the following example, all combinations of two different inputs and output methods are shown. The Arabic string comes in as a sequence of bytes without Perl knowing it should be UTF-8. The French one, on the other hand, is proper UTF-8 (thanks to use utf8; and saving the source as UTF-8). When writing bytes to the output not trying to interpret the bytes, we get the "correct" solution. Similarly for UTF-8 string and UTF-8 output. The other two combinations are wrong.
        #!/usr/bin/perl use warnings; use strict; use utf8; my %topurls = (arabic => { title => join(q(), map chr $_, 216, 167, 217, 132, 216, 185, + 216, 177, 216, 168, 217, 138, 216, + 169), count => 42, users => 11, }, french => { title => 'une chèvre goûte des légumes', count => 11, users => 42, } ); open my $OUT, '>', 'topurls.htm' or die "Can't open output: $!"; print $OUT <<'END_HEADER'; <html> <head> <title>Top URLs</title> <meta http-equiv="Content-Type" content="text/html; ch +arset=utf-8"> </head> <body> <h3>Top URLs</h3> <table cellpadding=10 border=1><tr><th>Link</th><th>Co +unt</th><th>Users</th></tr> END_HEADER for my $u (keys %topurls) { my @line; $line[0] = '<a target="_blank" href="'.$u.'">'.$topurls{$u}{title} +.'</a>'; $line[1] = $topurls{$u}{count}; $line[2] = $topurls{$u}{users}; binmode $OUT, ':bytes'; print $OUT '<tr><td>Bytes: ', join('</td><td>', @line), "</td></tr +>\n"; binmode $OUT, ':utf8'; print $OUT '<tr><td>UTF-8: ', join('</td><td>', @line), "</td></tr +>\n"; } print $OUT '</table></body></html>'; close $OUT;

        Now you just have to find out what kind of input you have.

        لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ
SOLVED Re: Writing HTML file with UTF-8 chars
by cormanaz (Chaplain) on Apr 27, 2013 at 18:33 UTC
    The solution was
    • Add use utf8;
    • add binmode OUT, ":utf8"; after the file open
    • Do not use encode() on the values
    Thanks for the comebacks everyone.

      On the second point: add binmode OUT, ":utf8";after the file open

      I think is better, to use binmode OUT, ":encoding(UTF-8)"; because ':encoding(UTF-8)'checks the data for actually being valid UTF-8, while ':utf8' just marks the data as UTF-8 without further checking.
      Please check binmode.

      If you tell me, I'll forget.
      If you show me, I'll remember.
      if you involve me, I'll understand.
      --- Author unknown to me

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1030979]
Approved by Old_Gray_Bear
Front-paged by toolic
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others having an uproarious good time at the Monastery: (11)
As of 2014-09-17 08:53 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (70 votes), past polls