Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

Re^3: HTML parsing module handles known and unknown encoding

by grantm (Parson)
on Nov 16, 2011 at 21:07 UTC ( #938467=note: print w/ replies, xml ) Need Help??


in reply to Re^2: HTML parsing module handles known and unknown encoding
in thread HTML parsing module handles known and unknown encoding

I don't see any way of specifying the encoding of an HTML document

Yes, HTML encoding is specified in the HTTP headers, but you can use the 'http-equiv' attribute on a <meta> tag to include arbitrary headers in your HTML. For example:

<meta http-equiv="content-type" content="text/html; charset=utf-8" + />

Of course this will really only work in cases where the encoding is some superset of ASCII (like iso8859-*, utf8 etc).


Comment on Re^3: HTML parsing module handles known and unknown encoding
Select or Download Code
Re^4: HTML parsing module handles known and unknown encoding
by ikegami (Pope) on Nov 16, 2011 at 22:19 UTC

    What are you saying? Are you pointing out the irrelevant fact that XML::LibXML can process some HTML other documents? Are you suggesting one should convince the provider of the HTML document to edit them so XML::LibXML can process them? Are you suggesting it's acceptable to do the following to parse an HTML document using XML::LibXML?

    1. Parse the HTML doc using another parser that can accept an encoding.
    2. If the document does not indicate its own encoding,
      1. Add a META element if none exist.
      2. Serialise the HTML.
      3. Replace the original HTML with this new HTML.
    3. Parse the HTML doc using XML::LibXML.
      I was merely pointing out that if the HTML includes an encoding ("charset") declaration, then XML::LibXML's parse_html method will honour it. I guess that's not much use if the HTML doesn't include a declaration.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://938467]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others pondering the Monastery: (5)
As of 2014-08-31 10:24 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The best computer themed movie is:











    Results (294 votes), past polls