Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Encoding/decoding question

by slugger415 (Scribe)
on Sep 11, 2011 at 15:33 UTC ( #925358=perlquestion: print w/ replies, xml ) Need Help??
slugger415 has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks,

I am processing some XHTML pages (using XML::Twig) that contain numerous character entities, such as:

é

When I parse these files using XML::Twig, they turn into all sorts of wonky characters that look nothing like they did in the original HTML.

réserve
becomes
réserve

I've tried setting keep_encoding in Twig, and the entities get preserved, but I get another set of wonky characters when that output goes to HTML.

I'm not sure how to proceed here -- any thoughts? I'm sure there's some kind of encoding/decoding process I need to do here, but I'm unfamiliar with the process.

Many thanks.

Scott

Comment on Encoding/decoding question
Replies are listed 'Best First'.
Re: Encoding/decoding question
by ikegami (Pope) on Sep 11, 2011 at 18:23 UTC

    The original XHTML is broken.

    • à or à is U+00C3 LATIN CAPITAL LETTER A WITH TILDE, Ã.
    • © or © is U+00A9 COPYRIGHT SIGN, ©.

    XML::Twig is actually returning the correct data.

    réserve
    should be
    réserve
    • é or é is U+00E9 LATIN SMALL LETTER E WITH ACUTE, é.

    It appears that the XHTML was produced using

    encode_entities(encode("UTF-8", "réserve"))

    when one should use

    encode("UTF-8", encode_entities("réserve"))
      aha! interesting -- I actually used tidy.exe to convert the HTML to XHTML, so tidy must be the culprit. Thanks for the tip!

      (know any better way to turn HTML into XHTML? maybe I should just be using a Perl HTML parser...)

        I doubt that. I suspect the HTML was buggy too.

        Could you show the HTML's HEAD element and the od -c output for réserve?

        ( Update: hum, .exe? You might not have od. Alternative: perl -nE"say unpack 'H*', $_ if /serv/;" file.html )

        By the way, XML::LibXML has functions for parsing HTML.

        You can use HTML::TreeBuilder to parse the HTML, then output it in XHTML, using the as_XML method, which works most of the time. It may not help with the encoding problem though, especially if the HTML lies about its encoding. XML::Twig can do this for you BTW, so in fact you may not need to use tidy at all, just install HTML::TreeBuilder and then use parsefile_html to parse the HTML.

        Also HTML::Tidy uses a fork of tidy, and may be worth a try.

Re: Encoding/decoding question
by Anonymous Monk on Sep 11, 2011 at 15:40 UTC
Re: Encoding/decoding question
by slugger415 (Scribe) on Sep 13, 2011 at 14:07 UTC
    Thank you one and all for your comments and suggestions. I still don't understand what's happening with the encoding, but I do have some good options to move forward. Many thanks.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://925358]
Approved by keszler
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others cooling their heels in the Monastery: (13)
As of 2015-07-28 16:22 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (258 votes), past polls