Re^4: Ordering meta tags with HTML::Element

Replies are listed 'Best First'.
Re^5: Ordering meta tags with HTML::Element by afoken (Chancellor) on Jun 15, 2016 at 06:26 UTC
You have to know the encoding, and you have have to know it first, not after mish-mashing things together and guessing and hoping and blaming the spec. In an ideal world, with HTML spec'd and written for a single-pass parser, yes. In this world, no. Any browser procesing HTML, valid or tag soup, classic or XHTML, is generally using several steps to process input. One of them is to find the encoding. A HTTP "`Content-Type`" header with a "`charset`" is one of the ways to find out the encoding, `meta` tags are a second way, and Byte Order Marks are also used, plus a lot of heuristics. That works quite well: A "`charset`" information from a "`Content-Type`" header is a good first guess. A BOM is very easy, just a very specific byte sequence for each UTF-encoding at the start of the document. Without a BOM, UTF-16 and UTF-32 can easily be guessed due to the very specific mix of 0x00 bytes and non-0x00 bytes. Without a BOM, UTF-8 has a lot of restrictions for bytes >= 0x80, if none of those restrictions is violated, it is very likely that the input is UTF-8. UTF-8 and most other encodings used are a superset of ASCII, so bytes 0x00 to 0x7F can be treated as ASCII in a first pass. (0x80 to 0xFF are just some line noise at this point.) HTML element and attribute names are limited to ASCII, as are "charset" (encoding) names used in HTTP headers and attribute values. You do not need to know the exact encoding at this point. You just have to know how to read the ASCII characters. As written above, this means one of five ways: 8-Bit ASCII superset (including UTF-8), UTF-16LE, UTF-16BE, UTF-32LE, UTF-32BE. Now that the program has a usable guess of the encoding, it can search for `meta` tags, ignoring almost everything else. From `http-equiv` and `charset` attributes, it can read the encoding actually used; and start parsing the entire document. The order used may differ from browser to browser, but a readable meta tag usually wins over HTTP headers. Alexander -- Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)	[reply] [d/l] [select]
Re^6: Ordering meta tags with HTML::Element by Your Mother (Archbishop) on Jun 15, 2016 at 13:21 UTC
The order used may differ from browser to browser, but a readable meta tag usually wins over HTTP headers. I thought that's what I said. :P Also of note if we're kitchen sinking it is that there have been serious security implications of not having the charset right; utf-7-xss-attacks-in-modern-browsers.	[reply]
Re^5: Ordering meta tags with HTML::Element by Anonymous Monk on Jun 15, 2016 at 02:27 UTC
Can you Data::Dump::dd() whatever document that is supposed to represent and explain the problem? Here is what I tested with firefox, where the charset is located makes no difference, as long as it appears in the document $VAR1 = "<!DOCTYPE html> <html> <title> O\314\276\315\253\315\204\314\277\314\221\315\244\315\243\314\ +276\315\233\314\201\314\201\315\200\314\241\315\236\314\264\315\217\3 +14\235\314\235\314\230\314\252\314\256\314\271\314\252\315\216\315\23 +2\314\236\314\237\314\243\314\261\314\244\314\230\314\272\315\225\314 +\252H\314\205\315\247\315\233\315\247\314\276\315\222\315\253\322\211 +\322\211\315\211\314\263\314\230\314\255\314\253\315\205\314\257A\315 +\204\314\212\315\256\315\251\314\205\314\214\315\204\315\256\315\247\ +315\236\314\265\315\200\315\201\314\270\315\231\314\251\314\261\314\2 +35\315\231\314\261\314\253\314\234\315\231\314\260\314\273\314\235\31 +5\225\315\211\314\255\314\256\314\226\315\226\315\207I\314\276\315\23 +3\314\205\314\211\314\241\315\201\314\236\314\227\315\210\315\231\314 +\240\315\223\315\211\314\257\314\235\314\256\314\262\314\256\315\225\ +315\205\314\243\314\255\314\252 </title> <p>entities<tt><title>O̾ͫ̈́̿̑ͤ&#867 +;̴̡̾͛́́̀͞͏̝̝&# +792;̪̮̹̪͎͚̞̟̣̱&#804 +;̘̺͕̪H̅ͧ͛ͧ̾͒ͫ& +#1161;҉͉̳̘̭̫̯ͅÄ́̊& +#878;̵ͩ̅̌̈́ͮͧ̀́͞&#82 +4;͙̩̱̝͙̱̫̜͙̰̻& +#797;͕͉̭̮̖͖͇I̾͛̅&#7 +77;̡̞̗͈͙̠͓͉̯̝́ +̮̲̮͕̣̭̪ͅ </title></tt> <p>straightup utf8 like title <tt> O\314\276\315\253\315\204\314\277\3 +14\221\315\244\315\243\314\276\315\233\314\201\314\201\315\200\314\24 +1\315\236\314\264\315\217\314\235\314\235\314\230\314\252\314\256\314 +\271\314\252\315\216\315\232\314\236\314\237\314\243\314\261\314\244\ +314\230\314\272\315\225\314\252H\314\205\315\247\315\233\315\247\314\ +276\315\222\315\253\322\211\322\211\315\211\314\263\314\230\314\255\3 +14\253\315\205\314\257A\315\204\314\212\315\256\315\251\314\205\314\2 +14\315\204\315\256\315\247\315\236\314\265\315\200\315\201\314\270\31 +5\231\314\251\314\261\314\235\315\231\314\261\314\253\314\234\315\231 +\314\260\314\273\314\235\315\225\315\211\314\255\314\256\314\226\315\ +226\315\207I\314\276\315\233\314\205\314\211\314\241\315\201\314\236\ +314\227\315\210\315\231\314\240\315\223\315\211\314\257\314\235\314\2 +56\314\262\314\256\315\225\315\205\314\243\314\255\314\252 </tt> <meta charset=\"utf-8\"> "; [download] The link I linked before warns against using encodings in which the bytes corresponding to `"<script>"` in ASCII can encode a different string, like JIS_C6226-1983, JIS_X0212-1990, HZ-GB-2312, JOHAB (Windows code page 1361), encodings based on ISO-2022, and encodings based on EBCDIC..authors must not use the CESU-8, UTF-7, BOCU-1 and SCSU encodings So the order doesn't matter	[reply] [d/l] [select]


Think about Loose Coupling
	PerlMonks