Re^3: HTML parsing module handles known and unknown encoding

by grantm (Parson)
on Nov 16, 2011 at 21:07 UTC

in reply to Re^2: HTML parsing module handles known and unknown encoding
in thread HTML parsing module handles known and unknown encoding

I don't see any way of specifying the encoding of an HTML document

Yes, HTML encoding is specified in the HTTP headers, but you can use the 'http-equiv' attribute on a <meta> tag to include arbitrary headers in your HTML. For example:

<meta http-equiv="content-type" content="text/html; charset=utf-8" + />

Of course this will really only work in cases where the encoding is some superset of ASCII (like iso8859-*, utf8 etc).

Re^4: HTML parsing module handles known and unknown encoding
on Nov 16, 2011 at 22:19 UTC

    What are you saying? Are you pointing out the irrelevant fact that XML::LibXML can process some HTML other documents? Are you suggesting one should convince the provider of the HTML document to edit them so XML::LibXML can process them? Are you suggesting it's acceptable to do the following to parse an HTML document using XML::LibXML?

    1. Parse the HTML doc using another parser that can accept an encoding.
    2. If the document does not indicate its own encoding,
      1. Add a META element if none exist.
      2. Serialise the HTML.
      3. Replace the original HTML with this new HTML.
    3. Parse the HTML doc using XML::LibXML.
      I was merely pointing out that if the HTML includes an encoding ("charset") declaration, then XML::LibXML's parse_html method will honour it. I guess that's not much use if the HTML doesn't include a declaration.

[erix]: there is lyx
[ambrus]: erix: that one actually sucks. these days people should get rid of the old notion that TeX is the only thing you can use for decent mathematics writing, because MS Office and LibreOffice have reached the
[ambrus]: level where people can more easily write as good mathematical papers in them as the people who write bad LaTeX papers usually write.

