Re^3: HTML parsing module handles known and unknown encoding

in reply to Re^2: HTML parsing module handles known and unknown encoding
in thread HTML parsing module handles known and unknown encoding

I don't see any way of specifying the encoding of an HTML document

Yes, HTML encoding is specified in the HTTP headers, but you can use the 'http-equiv' attribute on a <meta> tag to include arbitrary headers in your HTML. For example:

    <meta http-equiv="content-type" content="text/html; charset=utf-8"
+ />
[download]

Of course this will really only work in cases where the encoding is some superset of ASCII (like iso8859-*, utf8 etc).

Comment on Re^3: HTML parsing module handles known and unknown encoding Select or Download Code

Replies are listed 'Best First'.
Re^4: HTML parsing module handles known and unknown encoding by ikegami (Patriarch) on Nov 16, 2011 at 22:19 UTC
What are you saying? Are you pointing out the irrelevant fact that XML::LibXML can process some HTML other documents? Are you suggesting one should convince the provider of the HTML document to edit them so XML::LibXML can process them? Are you suggesting it's acceptable to do the following to parse an HTML document using XML::LibXML? Parse the HTML doc using another parser that can accept an encoding. If the document does not indicate its own encoding, Add a META element if none exist. Serialise the HTML. Replace the original HTML with this new HTML. Parse the HTML doc using XML::LibXML.	[reply]
Re^5: HTML parsing module handles known and unknown encoding by grantm (Parson) on Nov 17, 2011 at 23:42 UTC
I was merely pointing out that if the HTML includes an encoding ("charset") declaration, then XML::LibXML's parse_html method will honour it. I guess that's not much use if the HTML doesn't include a declaration.	[reply]

In Section Seekers of Perl Wisdom