Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much

Re^3: HTML parsing module handles known and unknown encoding

by grantm (Parson)
on Nov 16, 2011 at 21:07 UTC ( #938467=note: print w/replies, xml ) Need Help??

in reply to Re^2: HTML parsing module handles known and unknown encoding
in thread HTML parsing module handles known and unknown encoding

I don't see any way of specifying the encoding of an HTML document

Yes, HTML encoding is specified in the HTTP headers, but you can use the 'http-equiv' attribute on a <meta> tag to include arbitrary headers in your HTML. For example:

<meta http-equiv="content-type" content="text/html; charset=utf-8" + />

Of course this will really only work in cases where the encoding is some superset of ASCII (like iso8859-*, utf8 etc).

Replies are listed 'Best First'.
Re^4: HTML parsing module handles known and unknown encoding
by ikegami (Pope) on Nov 16, 2011 at 22:19 UTC

    What are you saying? Are you pointing out the irrelevant fact that XML::LibXML can process some HTML other documents? Are you suggesting one should convince the provider of the HTML document to edit them so XML::LibXML can process them? Are you suggesting it's acceptable to do the following to parse an HTML document using XML::LibXML?

    1. Parse the HTML doc using another parser that can accept an encoding.
    2. If the document does not indicate its own encoding,
      1. Add a META element if none exist.
      2. Serialise the HTML.
      3. Replace the original HTML with this new HTML.
    3. Parse the HTML doc using XML::LibXML.
      I was merely pointing out that if the HTML includes an encoding ("charset") declaration, then XML::LibXML's parse_html method will honour it. I guess that's not much use if the HTML doesn't include a declaration.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://938467]
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others cooling their heels in the Monastery: (5)
As of 2018-06-23 12:24 GMT
Find Nodes?
    Voting Booth?
    Should cpanminus be part of the standard Perl release?

    Results (125 votes). Check out past polls.