Re^2: HTML parsing module handles known and unknown encoding

This extracts the content-type from the meta tag, which is a good start, but it's not a complete solution. Here are some additional parts a complete solution will need.

The biggest problem is that after HTML::TreeBuilder has parsed the whole document, it has resolved character entities (ampersand escapes), so by that time it's too late to know the encoding: you'd have to decode only the part of the text (and attributes) that wasn't generated from entities.
The simplest solution for this would be to do two passes: first parse the beginning of the document (the HTML draft standard recommends 1024 bytes) to find the encoding from the meta tag, then decode the text of the document and reparse the whole thing. Another solution would be changing HTML::Parser to be able to start decoding the text immediately after the meta tag containing the encoding, but this seems more complicated than reparsing the beginning of the document.
You would have to extract the name of the encoding "windows-1252" from the content-type string "text/html; charset=windows-1252" (this part is easy).
Once you know the encoding, you will have to decode the input document. You might want to do this even if the document arrives in chunks (for HTML::TreeBuilder has such an API) rather than read from a file.
Even before starting to find the meta element specifying the encoding, you will have to check for UTF-16 input.
And you will, of course, need an interface where the user can decide between the module detecting the encoding this way versus the caller specifying an encoding.
Oh, and as a bonus, you may want this working even if you use only HTML::Parser without HTML::TreeBuilder.

Comment on Re^2: HTML parsing module handles known and unknown encoding Select or Download Code


Perl-Sensitive Sunglasses
	PerlMonks