Beefy Boxes and Bandwidth Generously Provided by pair Networks vroom
We don't bite newbies here... much
 
PerlMonks  

WWW::Dict::Leo::Org encoding issue

by fanticla (Scribe)
on Jun 13, 2010 at 17:40 UTC ( #844458=perlquestion: print w/ replies, xml ) Need Help??
fanticla has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

a rather silly problem: I'm using the WWW::Dict::Leo::Org module (it gets and parses the content of an online dictionary, HTML in utf8) but I am exeriencing the following encoding problem: I cannot properly show characters such as äüö.

The script is quite simple:

use strict; use warnings; use WWW::Dict::Leo::Org; use Data::Dumper; my $leo = new WWW::Dict::Leo::Org(); my @matches = $leo->translate("test"); open (OUT, "output.txt"); binmode(OUT, ":utf8"); print OUT Dumper(\@matches); close OUT;

If I open output.txt, for example with notepad++, I see that the encoding is right (utf8), but it fails to properly show characters such as äüö.

If I do not explicitly declare the utf flag (the HTML site is utf8 coded) and I open output.txt, I get a Ansii coded file. äüö are not correctly displayed. If I change the encoding in notepad++ from ansii to utf8, all characters are displayed right!

Anyone has a suggestion what I am doing wrong? Thanks, Cla

Comment on WWW::Dict::Leo::Org encoding issue
Download Code
Re: WWW::Dict::Leo::Org encoding issue
by Corion (Pope) on Jun 13, 2010 at 17:43 UTC

    Are you sure that your tool, Notepad++, understands when a file is encoded as UTF8 when opening it?

Re: WWW::Dict::Leo::Org encoding issue
by ikegami (Pope) on Jun 13, 2010 at 18:49 UTC

    If I change the encoding in notepad++ from ansii to utf8, all characters are displayed right!

    I'm confused. So it's working fine? Or do you want to output "ANSI" (which is probably really cp1252).

      notepad++ normally recognizes the encoding correctly.

      I want the output to be in utf-8. I read it at a later stage to display the text in a Text widget.

      The reading works so:

      open (IN, "<:utf8", "output.txt"); my $in = <IN>; while ($in) { #doing some formatting $in =~ s/\'//g; $in =~ s/\=//g; $in =~ s/\>//g; $in =~ s/(.*)(left)(.*)/$1$2$3/g; $text->insert('end', "$3"); } $in = <IN>; } close IN;

      Of course the text wiget doesnt show corectly the äöü.

      I am not an expert of encodings, but I could cope with all other encoding issues so far...

        Are you sure that your (Tk?) widget understands UTF8?

        Perhaps it only switches to UTF-8 mode automatically when the document starts with a BOM. Try adding "\x{FEFF}" to the start of your document.
Re: WWW::Dict::Leo::Org encoding issue
by graff (Chancellor) on Jun 14, 2010 at 03:09 UTC
    your code:
    open (OUT, "output.txt"); binmode(OUT, ":utf8"); print OUT Dumper(\@matches); close OUT;
    looks like you are opening OUT for read access, then trying to write to it. And you're not checking for any errors, so when something goes wrong, you don't hear about it.

    So, your script is not changing the contents of the file. Try opening for write access -- the nicest way would be:

    open( OUT, ">:utf8", "output.txt" ) or die "output.txt: $!\n";
    BTW, I think Data::Dumper will make sure to convert unicode characters to their "\x{h*}" form, rather than printing actual utf8-encoded byte strings.

    ikegami's point about printing a BOM character first is simply that many tools (including Notepad, Wordpad and other M$ utils) rely on a file-initial BOM as a sort of "magic word" that tells the tool how it should interpret the file contents. So, after the kind of open statement shown above, I would do:

    print OUT "\x{feff}\n"; # and then print all the utf8 text content...
Re: WWW::Dict::Leo::Org encoding issue
by wwe (Friar) on Jun 14, 2010 at 10:50 UTC
    I played a bit with your code. After some changes I get the right encoding and right special characters like äöü. The only problem ist some garbage when text is formatted on the website e.g.
    to scarify a road surface [constr.] eine Straße aufreißen &#65533;&#65533;[Straßenbau]
    I'm using Notepad2 for checking the file. See the code here:
    use strict; use warnings; use Data::Dumper; use v5.10; use Encode qw(encode decode encode_utf8); use WWW::Dict::Leo::Org; my $leo = WWW::Dict::Leo::Org->new( -Debug => 0 ); open( my $fh, ">:utf8", "leo-translate.txt" ); my $string = 'strasse'; #$string = encode_utf8 ($string); #$string = decode('utf8', $string); #$string = encode('utf8', $string); foreach my $match ( $leo->translate($string) ) { say {$fh} $match->{'title'}; foreach my $value ( @{ $match->{'data'} } ) { my $string1 = decode('utf8', $value->{'left'} ); my $string2 = decode('utf8', $value->{'right'} ); my $string = join("\t", $value->{'left'}, $value->{'right' +} ); printf {$fh} ("%-50s%-50s\n", $string1, $string2); } }
    The file contains:
    Substantive (8 of 8) Substantive (8 of 8) + avenue die Straße + forest road die Straße + highway die Straße + road die Straße + route die Straße + strait [geog.] die Straße &#65533;& +#65533;- Meer street die Straße + way die Straße + (keine) Substantiv: Straß - +Flexionstabelle: Straß *)Substantiv: Strass - Flexionstabelle: Strass + *) *) ein Service von canoo.net Substantive (68 of 68) Substantive (68 of 6 +8) sunken road Straße in Tieflage + road tunnel [constr.] Straße in Tunnellage + &#65533;&#65533;[Straßenbau] undivided road Straße mit einer Fäh +rbahn single carriageway road Straße mit einer Fah +rbahn &#65533;&#65533;[Straßenbau] undivided two-way road Straße mit einer Fah +rbahn two-way road Straße mit Gegenverk +ehr &#65533;&#65533;[Straßenbau] divided highway Straße mit getrennte +r Fahrbahn &#65533;&#65533;[Straßenbau] divided road Straße mit getrennte +r Fahrbahn &#65533;&#65533;[Straßenbau] clearway Straße mit Halteverb +ot (auch: Haltverbot) cobbled street Straße mit Kopfstein +pflaster dual carriageway Straße mit Mittelstr +eifen tar concrete road Straße mit Teerbeton + odd-lane highway Straße mit ungerader + Anzahl von Fahrbahnen odd-lane road Straße mit ungerader + Anzahl von Fahrbahnen Strait of Dover [geog.] Straße von Dover + Strait of Gibraltar [geog.] Straße von Gibraltar + Korea Strait Straße von Korea + B-road Straße zweiter Ordnu +ng &#65533;&#65533;[Straßenbau] minor road Straße zweiter Ordnu +ng &#65533;&#65533;[Straßenbau] non-principal road Straße zweiter Ordnu +ng &#65533;&#65533;[Straßenbau] secondary road Straße zweiter Ordnu +ng &#65533;&#65533;[Straßenbau] elevated guide way aufgeständerte Straß +e &#65533;&#65533;[Straßenbau] elevated road aufgeständerte Straß +e &#65533;&#65533;[Straßenbau] elevated way aufgeständerte Straß +e &#65533;&#65533;[Straßenbau] road overpass aufgeständerte Straß +e &#65533;&#65533;[Straßenbau] stilted road aufgeständerte Straß +e &#65533;&#65533;[Straßenbau] vehicle-access road befahrbare Straße + wide road breite Straße + three-lane road dreispurige Straße & +#65533;&#65533;[Straßenbau] pavement pizza [coll.] Erbrochenes auf der +Straße embanked road erhöhte Straße &#655 +33;&#65533;[Straßenbau] road on embankment erhöhte Straße &#655 +33;&#65533;[Straßenbau] European Agreement concerning the International Carriage of Dangerous +Goods by Road [env.]Europäisches Übereinkommen über die international +e Beförderung gefährlicher Güter auf der Straße flow line die Fließ-Straße + toll road gebührenpflichtige S +traße turnpike gebührenpflichtige S +traße metaledAE road gepflasterte Straße + metalledBE road gepflasterte Straße + paved road gepflasterte Straße + crushed rock road geschotterte Straße + metaledAE road geschotterte Straße + metalledBE road geschotterte Straße + staggered mill gestaffelte Straße + lane kleine Straße + Korea Strait die Korea-Straße + grade-separated highway kreuzungsfreie Straß +e &#65533;&#65533;[Straßenbau] twisting road kurvenreiche Straße +&#65533;&#65533;[Straßenbau] winding road kurvenreiche Straße +&#65533;&#65533;[Straßenbau] burying under the road [tech.] Leitungsverlegung in + der Straße layout of a road Linienführung einer +Straße &#65533;&#65533;[Straßenbau] lie of a road Linienführung einer +Straße &#65533;&#65533;[Straßenbau] multi-lane road mehrspurige Straße & +#65533;&#65533;[Straßenbau] public road öffentliche Straße + off-street parking Parken abseits der S +traße stop-and-search operation Polizeikontrolle auf + der Straße road testing Prüfung auf der Stra +ße slippery road surface rutschige Straße + steep road steile Straße + covered urban street überbaute Straße + covered arcade überdachte Straße + dirt road unbefestigte Straße + dirt track unbefestigte Straße + earth road (Brit.) unbefestigte Straße + gravel road unbefestigte Straße +&#65533;&#65533;[Straßenbau] underground thoroughfare [constr.] unterirdische Straße + &#65533;&#65533;[Straßenbau] multi-lane road vielspurige Straße & +#65533;&#65533;[Straßenbau] four-lane highway vierspurige Straße + two-lane road zweispurige Straße & +#65533;&#65533;[Straßenbau] Verben (11 of 11) Verben (11 of 11) + to scarify a road surface [constr.] eine Straße aufreiße +n &#65533;&#65533;[Straßenbau] to repair the road die Straße ausbesser +n to go along the street die Straße entlang g +ehen to cross the road die Straße überquere +n to turn off a road eine Straße verlasse +n to litter the street Abfälle auf die Stra +ße werfen to end up on the street [fig.] auf der Straße lande +n to live rough (Brit.) auf der Straße leben + to sleep rough (Brit.) auf der Straße leben + to turn adrift auf die Straße setze +n to live on a street in einer Straße wohn +en Adjektive/Adverbien (6 of 6) Adjektive/Adverbien +(6 of 6) along the road die Straße entlang + at the road's end am Ende der Straße + in the street auf der Straße + on the road auf der Straße + on the street (Amer.) auf der Straße + free on road frei bis Straße + Definitionen (4 of 4) Definitionen (4 of 4 +) jaywalking bei Rot über die Str +aße gehen Whitehall (Brit.) Straße in London zwi +schen Trafalgar Square und Houses of Parliament, d. h. im brit. Regie +rungsviertel to jaywalk unachtsam eine Straß +e überqueren jaywalking unachtsames Überquer +en einer Straße Wendungen/Ausdrücke (2 of 2) Wendungen/Ausdrücke +(2 of 2) Road closed! Straße gesperrt! + The streets are paved with gold. Das Geld liegt auf d +er Straße. Beispiele (5 of 5) Beispiele (5 of 5) + the road is under repair die Straße wird eben + ausgebessert It's a busy street. Es ist eine verkehrs +reiche Straße. the man in the street der Mann auf der Str +aße Where does this road go to? Wohin führt diese St +raße? on highways auf Straßen außerhal +b von Ortschaften + *) ein Service von canoo.net +
    Updated: fixed spelling mistakes
Re: WWW::Dict::Leo::Org encoding issue
by Krambambuli (Deacon) on Jun 14, 2010 at 15:20 UTC
    Just because no one mentioned this so far.

    Looking with an hex-viewer into your output.txt should allow for a first important divide:

    is the file containing what you want or need, as you want or need ?

    Once this clarified, you'll have a handle to track down the issue - towards Perl, towards notepad++/Windows or even in both directions.

      is the file containing what you want or need, as you want or need ?

      Apparently so, since everything appears correctly in the editor once he switches it to the right mode.

      The question he's asking is how to convince his editor to automatically switch to the right mode (UTF-8 encoded instead of "ANSI" encoded).

Re: WWW::Dict::Leo::Org encoding issue
by Yary (Scribe) on Jun 14, 2010 at 17:44 UTC
    I think the only way an editor could tell if a file was utf-8 and not ASCII "easily" was if it began with a byte order mark. Not needed or even recommended for utf-8 since it doesn't have a byte ordering, but editors on Windows like to see it. See the wikipedia entry.

    Try adding this line after "binmode":

    print OUT "\xEF\xBB\xBF";
      Previously mentioned
      open(my $fh, '>:utf8', $qfn) or die; print $fh "\x{FEFF}"; print $fh $text;
      is simpler than equivalent
      open(my $fh, '>:bytes', $qfn) or die; print $fh "\xEF\xBB\xBF"; binmode($fh, ':utf8'); print $fh $text;

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://844458]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others taking refuge in the Monastery: (10)
As of 2014-04-21 08:24 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    April first is:







    Results (492 votes), past polls