Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

Re: HTML parsing module handles known and unknown encoding

by Corion (Patriarch)
on Nov 16, 2011 at 15:49 UTC ( [id://938400]=note: print w/replies, xml ) Need Help??


in reply to HTML parsing module handles known and unknown encoding

It seems that XML::LibXML has thought about the problem and solved in the way that you should always pass octets to XML::LibXML. If you have an encoding handy, you're allowed to tell XML::LibXML about it, but it's not necessary.

I'm not sure how well XML::LibXML works with UTF-16LE and/or UTF-16BE and BOMs - you might need to use some regular (byte-)expressions to handle the BOM yourself.

  • Comment on Re: HTML parsing module handles known and unknown encoding

Replies are listed 'Best First'.
Re^2: HTML parsing module handles known and unknown encoding
by ikegami (Patriarch) on Nov 16, 2011 at 18:55 UTC

    That works fine for XML since XML must specify its encoding within the document (binary format), but not so much with HTML where the encoding is specified outside of the document (text format).

    I don't see any way of specifying the encoding of an HTML document, which is weird because XML::LibXML supposedly handles HTML.

    XML::LibXML handles UTF-16 just fine.

      I don't see any way of specifying the encoding of an HTML document, which is weird because XML::LibXML supposedly handles HTML.

      The docs of XML::LibXML::Parser says under the heading PARSER OPTIONS that there's a parser option encoding which sets the “character encoding of the input” for HTML.

        Awesome. Don't know how I missed it.
      I don't see any way of specifying the encoding of an HTML document

      Yes, HTML encoding is specified in the HTTP headers, but you can use the 'http-equiv' attribute on a <meta> tag to include arbitrary headers in your HTML. For example:

      <meta http-equiv="content-type" content="text/html; charset=utf-8" + />

      Of course this will really only work in cases where the encoding is some superset of ASCII (like iso8859-*, utf8 etc).

        What are you saying? Are you pointing out the irrelevant fact that XML::LibXML can process some HTML other documents? Are you suggesting one should convince the provider of the HTML document to edit them so XML::LibXML can process them? Are you suggesting it's acceptable to do the following to parse an HTML document using XML::LibXML?

        1. Parse the HTML doc using another parser that can accept an encoding.
        2. If the document does not indicate its own encoding,
          1. Add a META element if none exist.
          2. Serialise the HTML.
          3. Replace the original HTML with this new HTML.
        3. Parse the HTML doc using XML::LibXML.
      I thought that you could set the encoding through XML::IbXML::Document->setEncoding, but for that to work, you need to parse it first, which likely will ruin the encoded characters.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://938400]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others sharing their wisdom with the Monastery: (7)
As of 2024-04-23 21:43 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found