Re^2: HTML parsing module handles known and unknown encoding

That works fine for XML since XML must specify its encoding within the document (binary format), but not so much with HTML where the encoding is specified outside of the document (text format).

I don't see any way of specifying the encoding of an HTML document, which is weird because XML::LibXML supposedly handles HTML.

XML::LibXML handles UTF-16 just fine.

Comment on Re^2: HTML parsing module handles known and unknown encoding

Replies are listed 'Best First'.
Re^3: HTML parsing module handles known and unknown encoding by ambrus (Abbot) on Nov 17, 2011 at 08:50 UTC
I don't see any way of specifying the encoding of an HTML document, which is weird because XML::LibXML supposedly handles HTML. The docs of XML::LibXML::Parser says under the heading PARSER OPTIONS that there's a parser option `encoding` which sets the “character encoding of the input” for HTML.	[reply] [d/l]
Re^4: HTML parsing module handles known and unknown encoding by ikegami (Patriarch) on Nov 17, 2011 at 09:56 UTC
Awesome. Don't know how I missed it.	[reply]
Re^3: HTML parsing module handles known and unknown encoding by grantm (Parson) on Nov 16, 2011 at 21:07 UTC
I don't see any way of specifying the encoding of an HTML document Yes, HTML encoding is specified in the HTTP headers, but you can use the 'http-equiv' attribute on a `<meta>` tag to include arbitrary headers in your HTML. For example: `<meta http-equiv="content-type" content="text/html; charset=utf-8" + />` [download] Of course this will really only work in cases where the encoding is some superset of ASCII (like iso8859-*, utf8 etc).	[reply] [d/l] [select]
Re^4: HTML parsing module handles known and unknown encoding by ikegami (Patriarch) on Nov 16, 2011 at 22:19 UTC
What are you saying? Are you pointing out the irrelevant fact that XML::LibXML can process some HTML other documents? Are you suggesting one should convince the provider of the HTML document to edit them so XML::LibXML can process them? Are you suggesting it's acceptable to do the following to parse an HTML document using XML::LibXML? Parse the HTML doc using another parser that can accept an encoding. If the document does not indicate its own encoding, Add a META element if none exist. Serialise the HTML. Replace the original HTML with this new HTML. Parse the HTML doc using XML::LibXML.	[reply]
Re^5: HTML parsing module handles known and unknown encoding by grantm (Parson) on Nov 17, 2011 at 23:42 UTC
I was merely pointing out that if the HTML includes an encoding ("charset") declaration, then XML::LibXML's parse_html method will honour it. I guess that's not much use if the HTML doesn't include a declaration.	[reply]
Re^3: HTML parsing module handles known and unknown encoding by Corion (Patriarch) on Nov 16, 2011 at 19:07 UTC
I thought that you could set the encoding through XML::IbXML::Document->setEncoding, but for that to work, you need to parse it first, which likely will ruin the encoded characters.	[reply]


Perl: the Markov chain saw
	PerlMonks