http://www.perlmonks.org?node_id=938394

ambrus has asked for the wisdom of the Perl Monks concerning the following question:

Which HTML parsing modules can decode HTML of any encoding properly?

Ideally, I'd like a parser that I can invoke in two ways. If I already know the encoding of the HTML for sure (eg. from the HTTP header), I tell that encoding to the module and it decodes it (or I decode it myself and pass the decoded text, doesn't matter). If I don't know, I pass an undecoded byte stream, and it checks the HTML for a meta http-equiv content-type tag which tells the encoding (for which it will first has to check for byte order marks to be able to find that tag in utf-16 (and utf-32) encoded text), and decodes the HTML using that automatically. (If the encoding is unknown and there's no byte order mark, it guesses some default, which could be cp1252 or possibly user-specified.)

It appears that HTML::Tree cannot do this. Does anyone know about the parsers of HTML::Tidy or XML::LibXML, or any other module? Obviously the parsers of most browsers would have some code like this. I could try to implement this myself and contribute to HTML::Tree, but I would like to know about any existing implementation first.

Update 2011-11-17: striked out the part about utf-32, for the HTML5 draft standard recommends against it. However, I believe utf-16 encoded HTML exists in the wild, so that part may still be important.

  • Comment on HTML parsing module handles known and unknown encoding

Replies are listed 'Best First'.
Re: HTML parsing module handles known and unknown encoding
by Corion (Patriarch) on Nov 16, 2011 at 15:49 UTC

    It seems that XML::LibXML has thought about the problem and solved in the way that you should always pass octets to XML::LibXML. If you have an encoding handy, you're allowed to tell XML::LibXML about it, but it's not necessary.

    I'm not sure how well XML::LibXML works with UTF-16LE and/or UTF-16BE and BOMs - you might need to use some regular (byte-)expressions to handle the BOM yourself.

      That works fine for XML since XML must specify its encoding within the document (binary format), but not so much with HTML where the encoding is specified outside of the document (text format).

      I don't see any way of specifying the encoding of an HTML document, which is weird because XML::LibXML supposedly handles HTML.

      XML::LibXML handles UTF-16 just fine.

        I don't see any way of specifying the encoding of an HTML document, which is weird because XML::LibXML supposedly handles HTML.

        The docs of XML::LibXML::Parser says under the heading PARSER OPTIONS that there's a parser option encoding which sets the “character encoding of the input” for HTML.

        I don't see any way of specifying the encoding of an HTML document

        Yes, HTML encoding is specified in the HTTP headers, but you can use the 'http-equiv' attribute on a <meta> tag to include arbitrary headers in your HTML. For example:

        <meta http-equiv="content-type" content="text/html; charset=utf-8" + />

        Of course this will really only work in cases where the encoding is some superset of ASCII (like iso8859-*, utf8 etc).

        I thought that you could set the encoding through XML::IbXML::Document->setEncoding, but for that to work, you need to parse it first, which likely will ruin the encoded characters.
Re: HTML parsing module handles known and unknown encoding
by ikegami (Patriarch) on Nov 16, 2011 at 19:08 UTC

    XML parsers should expect encoded documents since the encoding is specified in the document itself. This is what XML::LibXML expects. (Even when using the HTML parsing methods, which is problematic.)

    HTML parsers should expect decoded documents since the encoding is not specified in the document itself. You are clearly dealing with HTML since you mention "HTTP header".

    And that's where you should be looking, in the tool that handles the HTTP transfer. $response->decoded_content() will decode the HTML for you, based on the HTTP header, BOM and META elements (if HTML).

    The default charset decoded_content uses is is iso-8859-1, but you can can change it to cp1252 by passing default_charset => 'cp1252' as arguments to decoded_content.

    my $decoded_html = $response->decoded_content(default_charset => 'cp12 +52');
      HTML parsers should expect decoded documents since the encoding is not specified in the document itself.

      Indeed, if you know the encoding of the document, from the HTTP header or somewhere else, then you should decode the HTML using that, so the HTML parser should accept such a decoded string. However, in the real world, you'll often find that HTML documents are served over HTTP where the HTTP headers don't tell the encoding of the document. In that case, you'll need a way to find the encoding from the document itself. A HTML parser should support both of these cases.

      Update: ikegami warns me that he writes in the post that LWP can find the encoding from the meta tag. I'll definitely look at this, for even if I don't use LWP to retrieve the document through HTTP, I can probably ask it to find the meta tag (LWP interface is usually quite nice in such things) or at the very least I can look at the implementation. Thank you for the hint.

        However, in the real world, you'll often find that HTML documents are served over HTTP where the HTTP headers don't tell the encoding of the document.

        You appear to have missed crucial information in my post. Like I said, $response->decoded_content() will decode the HTML for you, based on the HTTP header, BOM and META elements (if HTML).

        In that case, you'll need a way to find the encoding from the document itself. A HTML parser should support both of these cases.

        Sure, though I'm not familiar with an HTML parser that supports both.

Re: HTML parsing module handles known and unknown encoding
by kennethk (Abbot) on Nov 16, 2011 at 17:45 UTC
    Forgive my ignorance, but how does that following not meet your spec? Obviously I haven't included long characters in the test, but I would like to understand what the trip-ups are for my own education.

    #!/usr/bin/perl -w use strict; require HTML::TreeBuilder; my $tree = HTML::TreeBuilder->new; # empty tree $tree->parse($_) while <DATA>; $tree->eof; $tree->elementify(); # Looking for # <meta http-equiv=Content-Type content="text/html; charset=windows-12 +52"> my $content_type = $tree->look_down( '_tag', 'meta', sub { my $elem = shift; $elem->attr('http-equiv') eq 'Content-Type'; } ); print $content_type->attr('content'), "\n"; __DATA__ <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40"> <head> <meta http-equiv=Content-Type content="text/html; charset=windows-1252 +"> <meta name=ProgId content=Word.Document> <meta name=Generator content="Microsoft Word 14"> <meta name=Originator content="Microsoft Word 14"> <link rel=File-List href="junk_files/filelist.xml"> <link rel=themeData href="junk_files/themedata.thmx"> <link rel=colorSchemeMapping href="junk_files/colorschememapping.xml"> </head> <body lang=EN-US style='tab-interval:.5in'> <div class=WordSection1> <p class=MsoNormal>This is some text</p> </div> </body> </html>

      This extracts the content-type from the meta tag, which is a good start, but it's not a complete solution. Here are some additional parts a complete solution will need.

      • The biggest problem is that after HTML::TreeBuilder has parsed the whole document, it has resolved character entities (ampersand escapes), so by that time it's too late to know the encoding: you'd have to decode only the part of the text (and attributes) that wasn't generated from entities.
        The simplest solution for this would be to do two passes: first parse the beginning of the document (the HTML draft standard recommends 1024 bytes) to find the encoding from the meta tag, then decode the text of the document and reparse the whole thing. Another solution would be changing HTML::Parser to be able to start decoding the text immediately after the meta tag containing the encoding, but this seems more complicated than reparsing the beginning of the document.
      • You would have to extract the name of the encoding "windows-1252" from the content-type string "text/html; charset=windows-1252" (this part is easy).
      • Once you know the encoding, you will have to decode the input document. You might want to do this even if the document arrives in chunks (for HTML::TreeBuilder has such an API) rather than read from a file.
      • Even before starting to find the meta element specifying the encoding, you will have to check for UTF-16 input.
      • And you will, of course, need an interface where the user can decide between the module detecting the encoding this way versus the caller specifying an encoding.
      • Oh, and as a bonus, you may want this working even if you use only HTML::Parser without HTML::TreeBuilder.