http://www.perlmonks.org?node_id=925358

slugger415 has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks,

I am processing some XHTML pages (using XML::Twig) that contain numerous character entities, such as:

é

When I parse these files using XML::Twig, they turn into all sorts of wonky characters that look nothing like they did in the original HTML.

réserve
becomes
réserve

I've tried setting keep_encoding in Twig, and the entities get preserved, but I get another set of wonky characters when that output goes to HTML.

I'm not sure how to proceed here -- any thoughts? I'm sure there's some kind of encoding/decoding process I need to do here, but I'm unfamiliar with the process.

Many thanks.

Scott

Replies are listed 'Best First'.
Re: Encoding/decoding question
by ikegami (Patriarch) on Sep 11, 2011 at 18:23 UTC

    The original XHTML is broken.

    • à or à is U+00C3 LATIN CAPITAL LETTER A WITH TILDE, Ã.
    • © or © is U+00A9 COPYRIGHT SIGN, ©.

    XML::Twig is actually returning the correct data.

    réserve
    should be
    réserve
    • é or é is U+00E9 LATIN SMALL LETTER E WITH ACUTE, é.

    It appears that the XHTML was produced using

    encode_entities(encode("UTF-8", "réserve"))

    when one should use

    encode("UTF-8", encode_entities("réserve"))
      aha! interesting -- I actually used tidy.exe to convert the HTML to XHTML, so tidy must be the culprit. Thanks for the tip!

      (know any better way to turn HTML into XHTML? maybe I should just be using a Perl HTML parser...)

        I doubt that. I suspect the HTML was buggy too.

        Could you show the HTML's HEAD element and the od -c output for réserve?

        ( Update: hum, .exe? You might not have od. Alternative: perl -nE"say unpack 'H*', $_ if /serv/;" file.html )

        By the way, XML::LibXML has functions for parsing HTML.

        You can use HTML::TreeBuilder to parse the HTML, then output it in XHTML, using the as_XML method, which works most of the time. It may not help with the encoding problem though, especially if the HTML lies about its encoding. XML::Twig can do this for you BTW, so in fact you may not need to use tidy at all, just install HTML::TreeBuilder and then use parsefile_html to parse the HTML.

        Also HTML::Tidy uses a fork of tidy, and may be worth a try.

Re: Encoding/decoding question
by Anonymous Monk on Sep 11, 2011 at 15:40 UTC
Re: Encoding/decoding question
by slugger415 (Monk) on Sep 13, 2011 at 14:07 UTC
    Thank you one and all for your comments and suggestions. I still don't understand what's happening with the encoding, but I do have some good options to move forward. Many thanks.