Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??

This extracts the content-type from the meta tag, which is a good start, but it's not a complete solution. Here are some additional parts a complete solution will need.

  • The biggest problem is that after HTML::TreeBuilder has parsed the whole document, it has resolved character entities (ampersand escapes), so by that time it's too late to know the encoding: you'd have to decode only the part of the text (and attributes) that wasn't generated from entities.
    The simplest solution for this would be to do two passes: first parse the beginning of the document (the HTML draft standard recommends 1024 bytes) to find the encoding from the meta tag, then decode the text of the document and reparse the whole thing. Another solution would be changing HTML::Parser to be able to start decoding the text immediately after the meta tag containing the encoding, but this seems more complicated than reparsing the beginning of the document.
  • You would have to extract the name of the encoding "windows-1252" from the content-type string "text/html; charset=windows-1252" (this part is easy).
  • Once you know the encoding, you will have to decode the input document. You might want to do this even if the document arrives in chunks (for HTML::TreeBuilder has such an API) rather than read from a file.
  • Even before starting to find the meta element specifying the encoding, you will have to check for UTF-16 input.
  • And you will, of course, need an interface where the user can decide between the module detecting the encoding this way versus the caller specifying an encoding.
  • Oh, and as a bonus, you may want this working even if you use only HTML::Parser without HTML::TreeBuilder.

In reply to Re^2: HTML parsing module handles known and unknown encoding by ambrus
in thread HTML parsing module handles known and unknown encoding by ambrus

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others lurking in the Monastery: (5)
As of 2024-04-19 12:47 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found