fanticla has asked for the
wisdom of the Perl Monks concerning the following question:
Dear Monks,
a rather silly problem: I'm using the WWW::Dict::Leo::Org module (it gets and parses the content of an online dictionary, HTML in utf8) but I am exeriencing the following encoding problem: I cannot properly show characters such as äüö.
The script is quite simple:
use strict;
use warnings;
use WWW::Dict::Leo::Org;
use Data::Dumper;
my $leo = new WWW::Dict::Leo::Org();
my @matches = $leo->translate("test");
open (OUT, "output.txt");
binmode(OUT, ":utf8");
print OUT Dumper(\@matches);
close OUT;
If I open output.txt, for example with notepad++, I see that the encoding is right (utf8), but it fails to properly show characters such as äüö.
If I do not explicitly declare the utf flag (the HTML site is utf8 coded) and I open output.txt, I get a Ansii coded file. äüö are not correctly displayed. If I change the encoding in notepad++ from ansii to utf8, all characters are displayed right!
Anyone has a suggestion what I am doing wrong? Thanks, Cla
Re: WWW::Dict::Leo::Org encoding issue by Corion (Pope) on Jun 13, 2010 at 17:43 UTC |
Are you sure that your tool, Notepad++, understands when a file is encoded as UTF8 when opening it?
| [reply] |
Re: WWW::Dict::Leo::Org encoding issue by ikegami (Pope) on Jun 13, 2010 at 18:49 UTC |
If I change the encoding in notepad++ from ansii to utf8, all characters are displayed right!
I'm confused. So it's working fine? Or do you want to output "ANSI" (which is probably really cp1252).
| [reply] |
|
open (IN, "<:utf8", "output.txt");
my $in = <IN>;
while ($in) {
#doing some formatting
$in =~ s/\'//g;
$in =~ s/\=//g;
$in =~ s/\>//g;
$in =~ s/(.*)(left)(.*)/$1$2$3/g;
$text->insert('end', "$3");
}
$in = <IN>;
}
close IN;
Of course the text wiget doesnt show corectly the äöü.
I am not an expert of encodings, but I could cope with all other encoding issues so far... | [reply] [d/l] |
|
| [reply] |
|
Perhaps it only switches to UTF-8 mode automatically when the document starts with a BOM. Try adding "\x{FEFF}" to the start of your document.
| [reply] [d/l] |
|
Re: WWW::Dict::Leo::Org encoding issue by graff (Canon) on Jun 14, 2010 at 03:09 UTC |
open (OUT, "output.txt");
binmode(OUT, ":utf8");
print OUT Dumper(\@matches);
close OUT;
looks like you are opening OUT for read access, then trying to write to it. And you're not checking for any errors, so when something goes wrong, you don't hear about it.
So, your script is not changing the contents of the file. Try opening for write access -- the nicest way would be:
open( OUT, ">:utf8", "output.txt" ) or die "output.txt: $!\n";
BTW, I think Data::Dumper will make sure to convert unicode characters to their "\x{h*}" form, rather than printing actual utf8-encoded byte strings.
ikegami's point about printing a BOM character first is simply that many tools (including Notepad, Wordpad and other M$ utils) rely on a file-initial BOM as a sort of "magic word" that tells the tool how it should interpret the file contents. So, after the kind of open statement shown above, I would do:
print OUT "\x{feff}\n";
# and then print all the utf8 text content...
| [reply] [d/l] [select] |
Re: WWW::Dict::Leo::Org encoding issue by wwe (Friar) on Jun 14, 2010 at 10:50 UTC |
I played a bit with your code. After some changes I get the right encoding and right special characters like äöü. The only problem ist some garbage when text is formatted on the website e.g. to scarify a road surface [constr.]
eine Straße aufreißen ��[Straßenbau]
I'm using Notepad2 for checking the file. See the code here:
use strict;
use warnings;
use Data::Dumper;
use v5.10;
use Encode qw(encode decode encode_utf8);
use WWW::Dict::Leo::Org;
my $leo = WWW::Dict::Leo::Org->new( -Debug => 0 );
open( my $fh, ">:utf8", "leo-translate.txt" );
my $string = 'strasse';
#$string = encode_utf8 ($string);
#$string = decode('utf8', $string);
#$string = encode('utf8', $string);
foreach my $match ( $leo->translate($string) ) {
say {$fh} $match->{'title'};
foreach my $value ( @{ $match->{'data'} } ) {
my $string1 = decode('utf8', $value->{'left'} );
my $string2 = decode('utf8', $value->{'right'} );
my $string = join("\t", $value->{'left'}, $value->{'right'
+} );
printf {$fh} ("%-50s%-50s\n", $string1, $string2);
}
}
The file contains:
Substantive (8 of 8) Substantive (8 of 8)
+
avenue die Straße
+
forest road die Straße
+
highway die Straße
+
road die Straße
+
route die Straße
+
strait [geog.] die Straße �&
+#65533;- Meer
street die Straße
+
way die Straße
+
(keine) Substantiv: Straß -
+Flexionstabelle: Straß *)Substantiv: Strass - Flexionstabelle: Strass
+ *) *) ein Service von canoo.net
Substantive (68 of 68) Substantive (68 of 6
+8)
sunken road Straße in Tieflage
+
road tunnel [constr.] Straße in Tunnellage
+ ��[Straßenbau]
undivided road Straße mit einer Fäh
+rbahn
single carriageway road Straße mit einer Fah
+rbahn ��[Straßenbau]
undivided two-way road Straße mit einer Fah
+rbahn
two-way road Straße mit Gegenverk
+ehr ��[Straßenbau]
divided highway Straße mit getrennte
+r Fahrbahn ��[Straßenbau]
divided road Straße mit getrennte
+r Fahrbahn ��[Straßenbau]
clearway Straße mit Halteverb
+ot (auch: Haltverbot)
cobbled street Straße mit Kopfstein
+pflaster
dual carriageway Straße mit Mittelstr
+eifen
tar concrete road Straße mit Teerbeton
+
odd-lane highway Straße mit ungerader
+ Anzahl von Fahrbahnen
odd-lane road Straße mit ungerader
+ Anzahl von Fahrbahnen
Strait of Dover [geog.] Straße von Dover
+
Strait of Gibraltar [geog.] Straße von Gibraltar
+
Korea Strait Straße von Korea
+
B-road Straße zweiter Ordnu
+ng ��[Straßenbau]
minor road Straße zweiter Ordnu
+ng ��[Straßenbau]
non-principal road Straße zweiter Ordnu
+ng ��[Straßenbau]
secondary road Straße zweiter Ordnu
+ng ��[Straßenbau]
elevated guide way aufgeständerte Straß
+e ��[Straßenbau]
elevated road aufgeständerte Straß
+e ��[Straßenbau]
elevated way aufgeständerte Straß
+e ��[Straßenbau]
road overpass aufgeständerte Straß
+e ��[Straßenbau]
stilted road aufgeständerte Straß
+e ��[Straßenbau]
vehicle-access road befahrbare Straße
+
wide road breite Straße
+
three-lane road dreispurige Straße &
+#65533;�[Straßenbau]
pavement pizza [coll.] Erbrochenes auf der
+Straße
embanked road erhöhte Straße ʏ
+33;�[Straßenbau]
road on embankment erhöhte Straße ʏ
+33;�[Straßenbau]
European Agreement concerning the International Carriage of Dangerous
+Goods by Road [env.]Europäisches Übereinkommen über die international
+e Beförderung gefährlicher Güter auf der Straße
flow line die Fließ-Straße
+
toll road gebührenpflichtige S
+traße
turnpike gebührenpflichtige S
+traße
metaledAE road gepflasterte Straße
+
metalledBE road gepflasterte Straße
+
paved road gepflasterte Straße
+
crushed rock road geschotterte Straße
+
metaledAE road geschotterte Straße
+
metalledBE road geschotterte Straße
+
staggered mill gestaffelte Straße
+
lane kleine Straße
+
Korea Strait die Korea-Straße
+
grade-separated highway kreuzungsfreie Straß
+e ��[Straßenbau]
twisting road kurvenreiche Straße
+��[Straßenbau]
winding road kurvenreiche Straße
+��[Straßenbau]
burying under the road [tech.] Leitungsverlegung in
+ der Straße
layout of a road Linienführung einer
+Straße ��[Straßenbau]
lie of a road Linienführung einer
+Straße ��[Straßenbau]
multi-lane road mehrspurige Straße &
+#65533;�[Straßenbau]
public road öffentliche Straße
+
off-street parking Parken abseits der S
+traße
stop-and-search operation Polizeikontrolle auf
+ der Straße
road testing Prüfung auf der Stra
+ße
slippery road surface rutschige Straße
+
steep road steile Straße
+
covered urban street überbaute Straße
+
covered arcade überdachte Straße
+
dirt road unbefestigte Straße
+
dirt track unbefestigte Straße
+
earth road (Brit.) unbefestigte Straße
+
gravel road unbefestigte Straße
+��[Straßenbau]
underground thoroughfare [constr.] unterirdische Straße
+ ��[Straßenbau]
multi-lane road vielspurige Straße &
+#65533;�[Straßenbau]
four-lane highway vierspurige Straße
+
two-lane road zweispurige Straße &
+#65533;�[Straßenbau]
Verben (11 of 11) Verben (11 of 11)
+
to scarify a road surface [constr.] eine Straße aufreiße
+n ��[Straßenbau]
to repair the road die Straße ausbesser
+n
to go along the street die Straße entlang g
+ehen
to cross the road die Straße überquere
+n
to turn off a road eine Straße verlasse
+n
to litter the street Abfälle auf die Stra
+ße werfen
to end up on the street [fig.] auf der Straße lande
+n
to live rough (Brit.) auf der Straße leben
+
to sleep rough (Brit.) auf der Straße leben
+
to turn adrift auf die Straße setze
+n
to live on a street in einer Straße wohn
+en
Adjektive/Adverbien (6 of 6) Adjektive/Adverbien
+(6 of 6)
along the road die Straße entlang
+
at the road's end am Ende der Straße
+
in the street auf der Straße
+
on the road auf der Straße
+
on the street (Amer.) auf der Straße
+
free on road frei bis Straße
+
Definitionen (4 of 4) Definitionen (4 of 4
+)
jaywalking bei Rot über die Str
+aße gehen
Whitehall (Brit.) Straße in London zwi
+schen Trafalgar Square und Houses of Parliament, d. h. im brit. Regie
+rungsviertel
to jaywalk unachtsam eine Straß
+e überqueren
jaywalking unachtsames Überquer
+en einer Straße
Wendungen/Ausdrücke (2 of 2) Wendungen/Ausdrücke
+(2 of 2)
Road closed! Straße gesperrt!
+
The streets are paved with gold. Das Geld liegt auf d
+er Straße.
Beispiele (5 of 5) Beispiele (5 of 5)
+
the road is under repair die Straße wird eben
+ ausgebessert
It's a busy street. Es ist eine verkehrs
+reiche Straße.
the man in the street der Mann auf der Str
+aße
Where does this road go to? Wohin führt diese St
+raße?
on highways auf Straßen außerhal
+b von Ortschaften
+
*) ein Service von canoo.net
+
Updated: fixed spelling mistakes | [reply] [d/l] [select] |
Re: WWW::Dict::Leo::Org encoding issue by Krambambuli (Deacon) on Jun 14, 2010 at 15:20 UTC |
Just because no one mentioned this so far.
Looking with an hex-viewer into your output.txt should allow for a first important divide:
is the file containing what you want or need, as you want or need ?
Once this clarified, you'll have a handle to track down the issue - towards Perl, towards notepad++/Windows or even in both directions.
| [reply] |
|
is the file containing what you want or need, as you want or need ?
Apparently so, since everything appears correctly in the editor once he switches it to the right mode.
The question he's asking is how to convince his editor to automatically switch to the right mode (UTF-8 encoded instead of "ANSI" encoded).
| [reply] |
Re: WWW::Dict::Leo::Org encoding issue by Yary (Scribe) on Jun 14, 2010 at 17:44 UTC |
print OUT "\xEF\xBB\xBF";
| [reply] [d/l] |
|
open(my $fh, '>:utf8', $qfn) or die;
print $fh "\x{FEFF}";
print $fh $text;
is simpler than equivalent
open(my $fh, '>:bytes', $qfn) or die;
print $fh "\xEF\xBB\xBF";
binmode($fh, ':utf8');
print $fh $text;
| [reply] [d/l] [select] |
|
|