Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

Re^3: HTML parsing module handles known and unknown encoding

by grantm (Parson)
on Nov 16, 2011 at 21:07 UTC ( #938467=note: print w/ replies, xml ) Need Help??


in reply to Re^2: HTML parsing module handles known and unknown encoding
in thread HTML parsing module handles known and unknown encoding

I don't see any way of specifying the encoding of an HTML document

Yes, HTML encoding is specified in the HTTP headers, but you can use the 'http-equiv' attribute on a <meta> tag to include arbitrary headers in your HTML. For example:

<meta http-equiv="content-type" content="text/html; charset=utf-8" + />

Of course this will really only work in cases where the encoding is some superset of ASCII (like iso8859-*, utf8 etc).


Comment on Re^3: HTML parsing module handles known and unknown encoding
Select or Download Code
Re^4: HTML parsing module handles known and unknown encoding
by ikegami (Pope) on Nov 16, 2011 at 22:19 UTC

    What are you saying? Are you pointing out the irrelevant fact that XML::LibXML can process some HTML other documents? Are you suggesting one should convince the provider of the HTML document to edit them so XML::LibXML can process them? Are you suggesting it's acceptable to do the following to parse an HTML document using XML::LibXML?

    1. Parse the HTML doc using another parser that can accept an encoding.
    2. If the document does not indicate its own encoding,
      1. Add a META element if none exist.
      2. Serialise the HTML.
      3. Replace the original HTML with this new HTML.
    3. Parse the HTML doc using XML::LibXML.
      I was merely pointing out that if the HTML includes an encoding ("charset") declaration, then XML::LibXML's parse_html method will honour it. I guess that's not much use if the HTML doesn't include a declaration.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://938467]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others cooling their heels in the Monastery: (13)
As of 2015-07-02 12:54 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (37 votes), past polls