Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Re: HTML parsing module handles known and unknown encoding

by ikegami (Pope)
on Nov 16, 2011 at 19:08 UTC ( #938441=note: print w/ replies, xml ) Need Help??


in reply to HTML parsing module handles known and unknown encoding

XML parsers should expect encoded documents since the encoding is specified in the document itself. This is what XML::LibXML expects. (Even when using the HTML parsing methods, which is problematic.)

HTML parsers should expect decoded documents since the encoding is not specified in the document itself. You are clearly dealing with HTML since you mention "HTTP header".

And that's where you should be looking, in the tool that handles the HTTP transfer. $response->decoded_content() will decode the HTML for you, based on the HTTP header, BOM and META elements (if HTML).

The default charset decoded_content uses is is iso-8859-1, but you can can change it to cp1252 by passing default_charset => 'cp1252' as arguments to decoded_content.

my $decoded_html = $response->decoded_content(default_charset => 'cp12 +52');


Comment on Re: HTML parsing module handles known and unknown encoding
Select or Download Code
Re^2: HTML parsing module handles known and unknown encoding
by ambrus (Abbot) on Nov 17, 2011 at 07:05 UTC
    HTML parsers should expect decoded documents since the encoding is not specified in the document itself.

    Indeed, if you know the encoding of the document, from the HTTP header or somewhere else, then you should decode the HTML using that, so the HTML parser should accept such a decoded string. However, in the real world, you'll often find that HTML documents are served over HTTP where the HTTP headers don't tell the encoding of the document. In that case, you'll need a way to find the encoding from the document itself. A HTML parser should support both of these cases.

    Update: ikegami warns me that he writes in the post that LWP can find the encoding from the meta tag. I'll definitely look at this, for even if I don't use LWP to retrieve the document through HTTP, I can probably ask it to find the meta tag (LWP interface is usually quite nice in such things) or at the very least I can look at the implementation. Thank you for the hint.

      However, in the real world, you'll often find that HTML documents are served over HTTP where the HTTP headers don't tell the encoding of the document.

      You appear to have missed crucial information in my post. Like I said, $response->decoded_content() will decode the HTML for you, based on the HTTP header, BOM and META elements (if HTML).

      In that case, you'll need a way to find the encoding from the document itself. A HTML parser should support both of these cases.

      Sure, though I'm not familiar with an HTML parser that supports both.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://938441]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others examining the Monastery: (8)
As of 2014-07-10 03:28 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    When choosing user names for websites, I prefer to use:








    Results (198 votes), past polls