You have to know the encoding, and you have have to know it first, not after mish-mashing things together and guessing and hoping and blaming the spec.
In an ideal world, with HTML spec'd and written for a single-pass parser, yes. In this world, no. Any browser procesing HTML, valid or tag soup, classic or XHTML, is generally using several steps to process input. One of them is to find the encoding. A HTTP "Content-Type" header with a "charset" is one of the ways to find out the encoding, meta tags are a second way, and Byte Order Marks are also used, plus a lot of heuristics.
That works quite well:
- A "charset" information from a "Content-Type" header is a good first guess.
- A BOM is very easy, just a very specific byte sequence for each UTF-encoding at the start of the document.
- Without a BOM, UTF-16 and UTF-32 can easily be guessed due to the very specific mix of 0x00 bytes and non-0x00 bytes.
- Without a BOM, UTF-8 has a lot of restrictions for bytes >= 0x80, if none of those restrictions is violated, it is very likely that the input is UTF-8.
- UTF-8 and most other encodings used are a superset of ASCII, so bytes 0x00 to 0x7F can be treated as ASCII in a first pass. (0x80 to 0xFF are just some line noise at this point.) HTML element and attribute names are limited to ASCII, as are "charset" (encoding) names used in HTTP headers and attribute values. You do not need to know the exact encoding at this point. You just have to know how to read the ASCII characters. As written above, this means one of five ways: 8-Bit ASCII superset (including UTF-8), UTF-16LE, UTF-16BE, UTF-32LE, UTF-32BE.
- Now that the program has a usable guess of the encoding, it can search for meta tags, ignoring almost everything else. From http-equiv and charset attributes, it can read the encoding actually used; and start parsing the entire document.
The order used may differ from browser to browser, but a readable meta tag usually wins over HTTP headers.
Alexander
--
Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)
| [reply] [d/l] [select] |
| [reply] |
Can you Data::Dump::dd() whatever document that is supposed to represent and explain the problem?
Here is what I tested with firefox, where the charset is located makes no difference, as long as it appears in the document
$VAR1 = "<!DOCTYPE html>
<html>
<title> O\314\276\315\253\315\204\314\277\314\221\315\244\315\243\314\
+276\315\233\314\201\314\201\315\200\314\241\315\236\314\264\315\217\3
+14\235\314\235\314\230\314\252\314\256\314\271\314\252\315\216\315\23
+2\314\236\314\237\314\243\314\261\314\244\314\230\314\272\315\225\314
+\252H\314\205\315\247\315\233\315\247\314\276\315\222\315\253\322\211
+\322\211\315\211\314\263\314\230\314\255\314\253\315\205\314\257A\315
+\204\314\212\315\256\315\251\314\205\314\214\315\204\315\256\315\247\
+315\236\314\265\315\200\315\201\314\270\315\231\314\251\314\261\314\2
+35\315\231\314\261\314\253\314\234\315\231\314\260\314\273\314\235\31
+5\225\315\211\314\255\314\256\314\226\315\226\315\207I\314\276\315\23
+3\314\205\314\211\314\241\315\201\314\236\314\227\315\210\315\231\314
+\240\315\223\315\211\314\257\314\235\314\256\314\262\314\256\315\225\
+315\205\314\243\314\255\314\252 </title>
<p>entities<tt><title>O̾ͫ̈́̿̑ͤͣ
+;̴̡̾͛́́̀͞͏̝̝&#
+792;̪̮̹̪͎͚̞̟̣̱̤
+;̘̺͕̪H̅ͧ͛ͧ̾͒ͫ&
+#1161;҉͉̳̘̭̫̯ͅÄ́̊&
+#878;̵ͩ̅̌̈́ͮͧ̀́͞R
+4;͙̩̱̝͙̱̫̜͙̰̻&
+#797;͕͉̭̮̖͖͇I̾͛̅
+77;̡̞̗͈͙̠͓͉̯̝́
+̮̲̮͕̣̭̪ͅ
</title></tt>
<p>straightup utf8 like title <tt> O\314\276\315\253\315\204\314\277\3
+14\221\315\244\315\243\314\276\315\233\314\201\314\201\315\200\314\24
+1\315\236\314\264\315\217\314\235\314\235\314\230\314\252\314\256\314
+\271\314\252\315\216\315\232\314\236\314\237\314\243\314\261\314\244\
+314\230\314\272\315\225\314\252H\314\205\315\247\315\233\315\247\314\
+276\315\222\315\253\322\211\322\211\315\211\314\263\314\230\314\255\3
+14\253\315\205\314\257A\315\204\314\212\315\256\315\251\314\205\314\2
+14\315\204\315\256\315\247\315\236\314\265\315\200\315\201\314\270\31
+5\231\314\251\314\261\314\235\315\231\314\261\314\253\314\234\315\231
+\314\260\314\273\314\235\315\225\315\211\314\255\314\256\314\226\315\
+226\315\207I\314\276\315\233\314\205\314\211\314\241\315\201\314\236\
+314\227\315\210\315\231\314\240\315\223\315\211\314\257\314\235\314\2
+56\314\262\314\256\315\225\315\205\314\243\314\255\314\252 </tt>
<meta charset=\"utf-8\">
";
The link I linked before warns against using
encodings in which the bytes corresponding to "<script>" in ASCII can encode a different string, like JIS_C6226-1983, JIS_X0212-1990, HZ-GB-2312, JOHAB (Windows code page 1361), encodings based on ISO-2022, and encodings based on EBCDIC..authors must not use the CESU-8, UTF-7, BOCU-1 and SCSU encodings
So the order doesn't matter | [reply] [d/l] [select] |