Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Re^2: HTML parsing module handles known and unknown encoding

by ikegami (Patriarch)
on Nov 16, 2011 at 18:55 UTC ( [id://938438]=note: print w/replies, xml ) Need Help??


in reply to Re: HTML parsing module handles known and unknown encoding
in thread HTML parsing module handles known and unknown encoding

That works fine for XML since XML must specify its encoding within the document (binary format), but not so much with HTML where the encoding is specified outside of the document (text format).

I don't see any way of specifying the encoding of an HTML document, which is weird because XML::LibXML supposedly handles HTML.

XML::LibXML handles UTF-16 just fine.

  • Comment on Re^2: HTML parsing module handles known and unknown encoding

Replies are listed 'Best First'.
Re^3: HTML parsing module handles known and unknown encoding
by ambrus (Abbot) on Nov 17, 2011 at 08:50 UTC
    I don't see any way of specifying the encoding of an HTML document, which is weird because XML::LibXML supposedly handles HTML.

    The docs of XML::LibXML::Parser says under the heading PARSER OPTIONS that there's a parser option encoding which sets the “character encoding of the input” for HTML.

      Awesome. Don't know how I missed it.
Re^3: HTML parsing module handles known and unknown encoding
by grantm (Parson) on Nov 16, 2011 at 21:07 UTC
    I don't see any way of specifying the encoding of an HTML document

    Yes, HTML encoding is specified in the HTTP headers, but you can use the 'http-equiv' attribute on a <meta> tag to include arbitrary headers in your HTML. For example:

    <meta http-equiv="content-type" content="text/html; charset=utf-8" + />

    Of course this will really only work in cases where the encoding is some superset of ASCII (like iso8859-*, utf8 etc).

      What are you saying? Are you pointing out the irrelevant fact that XML::LibXML can process some HTML other documents? Are you suggesting one should convince the provider of the HTML document to edit them so XML::LibXML can process them? Are you suggesting it's acceptable to do the following to parse an HTML document using XML::LibXML?

      1. Parse the HTML doc using another parser that can accept an encoding.
      2. If the document does not indicate its own encoding,
        1. Add a META element if none exist.
        2. Serialise the HTML.
        3. Replace the original HTML with this new HTML.
      3. Parse the HTML doc using XML::LibXML.
        I was merely pointing out that if the HTML includes an encoding ("charset") declaration, then XML::LibXML's parse_html method will honour it. I guess that's not much use if the HTML doesn't include a declaration.
Re^3: HTML parsing module handles known and unknown encoding
by Corion (Patriarch) on Nov 16, 2011 at 19:07 UTC
    I thought that you could set the encoding through XML::IbXML::Document->setEncoding, but for that to work, you need to parse it first, which likely will ruin the encoded characters.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://938438]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others browsing the Monastery: (4)
As of 2024-04-19 13:29 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found