Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

Re^4: HTML parsing module handles known and unknown encoding

by ikegami (Pope)
on Nov 16, 2011 at 22:19 UTC ( #938478=note: print w/ replies, xml ) Need Help??


in reply to Re^3: HTML parsing module handles known and unknown encoding
in thread HTML parsing module handles known and unknown encoding

What are you saying? Are you pointing out the irrelevant fact that XML::LibXML can process some HTML other documents? Are you suggesting one should convince the provider of the HTML document to edit them so XML::LibXML can process them? Are you suggesting it's acceptable to do the following to parse an HTML document using XML::LibXML?

  1. Parse the HTML doc using another parser that can accept an encoding.
  2. If the document does not indicate its own encoding,
    1. Add a META element if none exist.
    2. Serialise the HTML.
    3. Replace the original HTML with this new HTML.
  3. Parse the HTML doc using XML::LibXML.


Comment on Re^4: HTML parsing module handles known and unknown encoding
Re^5: HTML parsing module handles known and unknown encoding
by grantm (Parson) on Nov 17, 2011 at 23:42 UTC
    I was merely pointing out that if the HTML includes an encoding ("charset") declaration, then XML::LibXML's parse_html method will honour it. I guess that's not much use if the HTML doesn't include a declaration.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://938478]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others exploiting the Monastery: (5)
As of 2014-07-12 16:39 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    When choosing user names for websites, I prefer to use:








    Results (240 votes), past polls