http://www.perlmonks.org?node_id=938467


in reply to Re^2: HTML parsing module handles known and unknown encoding
in thread HTML parsing module handles known and unknown encoding

I don't see any way of specifying the encoding of an HTML document

Yes, HTML encoding is specified in the HTTP headers, but you can use the 'http-equiv' attribute on a <meta> tag to include arbitrary headers in your HTML. For example:

<meta http-equiv="content-type" content="text/html; charset=utf-8" + />

Of course this will really only work in cases where the encoding is some superset of ASCII (like iso8859-*, utf8 etc).

Replies are listed 'Best First'.
Re^4: HTML parsing module handles known and unknown encoding
by ikegami (Patriarch) on Nov 16, 2011 at 22:19 UTC

    What are you saying? Are you pointing out the irrelevant fact that XML::LibXML can process some HTML other documents? Are you suggesting one should convince the provider of the HTML document to edit them so XML::LibXML can process them? Are you suggesting it's acceptable to do the following to parse an HTML document using XML::LibXML?

    1. Parse the HTML doc using another parser that can accept an encoding.
    2. If the document does not indicate its own encoding,
      1. Add a META element if none exist.
      2. Serialise the HTML.
      3. Replace the original HTML with this new HTML.
    3. Parse the HTML doc using XML::LibXML.
      I was merely pointing out that if the HTML includes an encoding ("charset") declaration, then XML::LibXML's parse_html method will honour it. I guess that's not much use if the HTML doesn't include a declaration.