Encoding/decoding question

slugger415 has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks,

I am processing some XHTML pages (using XML::Twig) that contain numerous character entities, such as:

&#195;&#169;

When I parse these files using XML::Twig, they turn into all sorts of wonky characters that look nothing like they did in the original HTML.

réserve

becomes

rÃ©serve

I've tried setting keep_encoding in Twig, and the entities get preserved, but I get another set of wonky characters when that output goes to HTML.

I'm not sure how to proceed here -- any thoughts? I'm sure there's some kind of encoding/decoding process I need to do here, but I'm unfamiliar with the process.

Many thanks.

Scott

Comment on Encoding/decoding question

Replies are listed 'Best First'.
Re: Encoding/decoding question by ikegami (Patriarch) on Sep 11, 2011 at 18:23 UTC
The original XHTML is broken. `Ã` or `Ã` is U+00C3 LATIN CAPITAL LETTER A WITH TILDE, Ã. `©` or `©` is U+00A9 COPYRIGHT SIGN, ©. XML::Twig is actually returning the correct data. `réserve` [download] should be `réserve` [download] `é` or `é` is U+00E9 LATIN SMALL LETTER E WITH ACUTE, é. It appears that the XHTML was produced using `encode_entities(encode("UTF-8", "réserve"))` [download] when one should use `encode("UTF-8", encode_entities("réserve"))` [download]	[reply] [d/l] [select]
Re^2: Encoding/decoding question by slugger415 (Monk) on Sep 11, 2011 at 19:13 UTC
aha! interesting -- I actually used tidy.exe to convert the HTML to XHTML, so tidy must be the culprit. Thanks for the tip! (know any better way to turn HTML into XHTML? maybe I should just be using a Perl HTML parser...)	[reply]
Re^3: Encoding/decoding question by ikegami (Patriarch) on Sep 11, 2011 at 20:26 UTC
I doubt that. I suspect the HTML was buggy too. Could you show the HTML's HEAD element and the `od -c` output for `réserve`? ( Update: hum, .exe? You might not have `od`. Alternative: `perl -nE"say unpack 'H*', $_ if /serv/;" file.html` ) By the way, XML::LibXML has functions for parsing HTML.	[reply] [d/l] [select]
Re^4: Encoding/decoding question by tchrist (Pilgrim) on Sep 12, 2011 at 00:43 UTC
Re^5: Encoding/decoding question by ikegami (Patriarch) on Sep 12, 2011 at 02:35 UTC
Some notes below your chosen depth have not been shown here
Re^5: Encoding/decoding question by Anonymous Monk on Sep 12, 2011 at 20:34 UTC
Re^4: Encoding/decoding question by slugger415 (Monk) on Sep 12, 2011 at 15:11 UTC
Re^5: Encoding/decoding question by tchrist (Pilgrim) on Sep 12, 2011 at 15:51 UTC
Some notes below your chosen depth have not been shown here
Re^3: Encoding/decoding question by mirod (Canon) on Sep 13, 2011 at 08:36 UTC
You can use HTML::TreeBuilder to parse the HTML, then output it in XHTML, using the `as_XML` method, which works most of the time. It may not help with the encoding problem though, especially if the HTML lies about its encoding. XML::Twig can do this for you BTW, so in fact you may not need to use `tidy` at all, just install HTML::TreeBuilder and then use `parsefile_html` to parse the HTML. Also HTML::Tidy uses a fork of tidy, and may be worth a try.	[reply]
Re: Encoding/decoding question by Anonymous Monk on Sep 11, 2011 at 15:40 UTC
I'm not sure how to proceed here -- any thoughts? Two :) Super Search and check the examples (like XML-Twig-3.38/t/test_safe_encode.t and t/test_autoencoding_conversion.t) Provide a test case	[reply]
Re: Encoding/decoding question by slugger415 (Monk) on Sep 13, 2011 at 14:07 UTC
Thank you one and all for your comments and suggestions. I still don't understand what's happening with the encoding, but I do have some good options to move forward. Many thanks.	[reply]

Back to Seekers of Perl Wisdom