Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery

Re: HTML parsing module handles known and unknown encoding

by kennethk (Abbot)
on Nov 16, 2011 at 17:45 UTC ( #938427=note: print w/replies, xml ) Need Help??

in reply to HTML parsing module handles known and unknown encoding

Forgive my ignorance, but how does that following not meet your spec? Obviously I haven't included long characters in the test, but I would like to understand what the trip-ups are for my own education.

#!/usr/bin/perl -w use strict; require HTML::TreeBuilder; my $tree = HTML::TreeBuilder->new; # empty tree $tree->parse($_) while <DATA>; $tree->eof; $tree->elementify(); # Looking for # <meta http-equiv=Content-Type content="text/html; charset=windows-12 +52"> my $content_type = $tree->look_down( '_tag', 'meta', sub { my $elem = shift; $elem->attr('http-equiv') eq 'Content-Type'; } ); print $content_type->attr('content'), "\n"; __DATA__ <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="" xmlns=""> <head> <meta http-equiv=Content-Type content="text/html; charset=windows-1252 +"> <meta name=ProgId content=Word.Document> <meta name=Generator content="Microsoft Word 14"> <meta name=Originator content="Microsoft Word 14"> <link rel=File-List href="junk_files/filelist.xml"> <link rel=themeData href="junk_files/themedata.thmx"> <link rel=colorSchemeMapping href="junk_files/colorschememapping.xml"> </head> <body lang=EN-US style='tab-interval:.5in'> <div class=WordSection1> <p class=MsoNormal>This is some text</p> </div> </body> </html>

Replies are listed 'Best First'.
Re^2: HTML parsing module handles known and unknown encoding
by ambrus (Abbot) on Nov 17, 2011 at 07:00 UTC

    This extracts the content-type from the meta tag, which is a good start, but it's not a complete solution. Here are some additional parts a complete solution will need.

    • The biggest problem is that after HTML::TreeBuilder has parsed the whole document, it has resolved character entities (ampersand escapes), so by that time it's too late to know the encoding: you'd have to decode only the part of the text (and attributes) that wasn't generated from entities.
      The simplest solution for this would be to do two passes: first parse the beginning of the document (the HTML draft standard recommends 1024 bytes) to find the encoding from the meta tag, then decode the text of the document and reparse the whole thing. Another solution would be changing HTML::Parser to be able to start decoding the text immediately after the meta tag containing the encoding, but this seems more complicated than reparsing the beginning of the document.
    • You would have to extract the name of the encoding "windows-1252" from the content-type string "text/html; charset=windows-1252" (this part is easy).
    • Once you know the encoding, you will have to decode the input document. You might want to do this even if the document arrives in chunks (for HTML::TreeBuilder has such an API) rather than read from a file.
    • Even before starting to find the meta element specifying the encoding, you will have to check for UTF-16 input.
    • And you will, of course, need an interface where the user can decide between the module detecting the encoding this way versus the caller specifying an encoding.
    • Oh, and as a bonus, you may want this working even if you use only HTML::Parser without HTML::TreeBuilder.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://938427]
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others romping around the Monastery: (5)
As of 2018-06-20 06:53 GMT
Find Nodes?
    Voting Booth?
    Should cpanminus be part of the standard Perl release?

    Results (116 votes). Check out past polls.