Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

Re^4: Ordering meta tags with HTML::Element

by Your Mother (Archbishop)
on Jun 15, 2016 at 01:00 UTC ( [id://1165675]=note: print w/replies, xml ) Need Help??


in reply to Re^3: Ordering meta tags with HTML::Element
in thread Ordering meta tags with HTML::Element

Well it's a glaring omission then. The <title>O̴̡̾ͫ̈́̿̑ͤͣ̾͛́́̀͞͏̝̝̘̪̮̹̪͎͚̞̟̣̱̤̘̺͕̪H̅ͧ͛ͧ̾͒ͫ҉҉͉̳̘̭̫̯ͅÄ̵̸͙̩̱̝͙̱̫̜͙̰̻̝͕͉̭̮̖͖͇́̊ͮͩ̅̌̈́ͮͧ̀́͞I̡̞̗͈͙̠͓͉̯̝̮̲̮͕̣̭̪̾͛̅̉́ͅ </title> may not be ASCII so the encoding RFC:MUST be known before parsing any content including other meta tags which might have encoded attributes. You have to know the encoding, and you have have to know it first, not after mish-mashing things together and guessing and hoping and blaming the spec. :P

And the obvious rebuttal will be Content-Type. Yes, it is necessary to be there and correctly agree but after a file is downloaded or opened locally, there are no server headers. The <meta/> should be there and has to be first to be perfectly robust.

Replies are listed 'Best First'.
Re^5: Ordering meta tags with HTML::Element
by afoken (Chancellor) on Jun 15, 2016 at 06:26 UTC
    You have to know the encoding, and you have have to know it first, not after mish-mashing things together and guessing and hoping and blaming the spec.

    In an ideal world, with HTML spec'd and written for a single-pass parser, yes. In this world, no. Any browser procesing HTML, valid or tag soup, classic or XHTML, is generally using several steps to process input. One of them is to find the encoding. A HTTP "Content-Type" header with a "charset" is one of the ways to find out the encoding, meta tags are a second way, and Byte Order Marks are also used, plus a lot of heuristics.

    That works quite well:

    • A "charset" information from a "Content-Type" header is a good first guess.
    • A BOM is very easy, just a very specific byte sequence for each UTF-encoding at the start of the document.
    • Without a BOM, UTF-16 and UTF-32 can easily be guessed due to the very specific mix of 0x00 bytes and non-0x00 bytes.
    • Without a BOM, UTF-8 has a lot of restrictions for bytes >= 0x80, if none of those restrictions is violated, it is very likely that the input is UTF-8.
    • UTF-8 and most other encodings used are a superset of ASCII, so bytes 0x00 to 0x7F can be treated as ASCII in a first pass. (0x80 to 0xFF are just some line noise at this point.) HTML element and attribute names are limited to ASCII, as are "charset" (encoding) names used in HTTP headers and attribute values. You do not need to know the exact encoding at this point. You just have to know how to read the ASCII characters. As written above, this means one of five ways: 8-Bit ASCII superset (including UTF-8), UTF-16LE, UTF-16BE, UTF-32LE, UTF-32BE.
    • Now that the program has a usable guess of the encoding, it can search for meta tags, ignoring almost everything else. From http-equiv and charset attributes, it can read the encoding actually used; and start parsing the entire document.

    The order used may differ from browser to browser, but a readable meta tag usually wins over HTTP headers.

    Alexander

    --
    Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)
      The order used may differ from browser to browser, but a readable meta tag usually wins over HTTP headers.

      I thought that's what I said. :P Also of note if we're kitchen sinking it is that there have been serious security implications of not having the charset right; utf-7-xss-attacks-in-modern-browsers.

Re^5: Ordering meta tags with HTML::Element
by Anonymous Monk on Jun 15, 2016 at 02:27 UTC

    Can you Data::Dump::dd() whatever document that is supposed to represent and explain the problem?

    Here is what I tested with firefox, where the charset is located makes no difference, as long as it appears in the document

    $VAR1 = "<!DOCTYPE html> <html> <title> O\314\276\315\253\315\204\314\277\314\221\315\244\315\243\314\ +276\315\233\314\201\314\201\315\200\314\241\315\236\314\264\315\217\3 +14\235\314\235\314\230\314\252\314\256\314\271\314\252\315\216\315\23 +2\314\236\314\237\314\243\314\261\314\244\314\230\314\272\315\225\314 +\252H\314\205\315\247\315\233\315\247\314\276\315\222\315\253\322\211 +\322\211\315\211\314\263\314\230\314\255\314\253\315\205\314\257A\315 +\204\314\212\315\256\315\251\314\205\314\214\315\204\315\256\315\247\ +315\236\314\265\315\200\315\201\314\270\315\231\314\251\314\261\314\2 +35\315\231\314\261\314\253\314\234\315\231\314\260\314\273\314\235\31 +5\225\315\211\314\255\314\256\314\226\315\226\315\207I\314\276\315\23 +3\314\205\314\211\314\241\315\201\314\236\314\227\315\210\315\231\314 +\240\315\223\315\211\314\257\314\235\314\256\314\262\314\256\315\225\ +315\205\314\243\314\255\314\252 </title> <p>entities<tt>&lt;title&gt;O&#830;&#875;&#836;&#831;&#785;&#868;&#867 +;&#830;&#859;&#769;&#769;&#832;&#801;&#862;&#820;&#847;&#797;&#797;&# +792;&#810;&#814;&#825;&#810;&#846;&#858;&#798;&#799;&#803;&#817;&#804 +;&#792;&#826;&#853;&#810;H&#773;&#871;&#859;&#871;&#830;&#850;&#875;& +#1161;&#1161;&#841;&#819;&#792;&#813;&#811;&#837;&#815;A&#836;&#778;& +#878;&#873;&#773;&#780;&#836;&#878;&#871;&#862;&#821;&#832;&#833;&#82 +4;&#857;&#809;&#817;&#797;&#857;&#817;&#811;&#796;&#857;&#816;&#827;& +#797;&#853;&#841;&#813;&#814;&#790;&#854;&#839;I&#830;&#859;&#773;&#7 +77;&#801;&#833;&#798;&#791;&#840;&#857;&#800;&#851;&#841;&#815;&#797; +&#814;&#818;&#814;&#853;&#837;&#803;&#813;&#810; &lt;/title&gt;</tt> <p>straightup utf8 like title <tt> O\314\276\315\253\315\204\314\277\3 +14\221\315\244\315\243\314\276\315\233\314\201\314\201\315\200\314\24 +1\315\236\314\264\315\217\314\235\314\235\314\230\314\252\314\256\314 +\271\314\252\315\216\315\232\314\236\314\237\314\243\314\261\314\244\ +314\230\314\272\315\225\314\252H\314\205\315\247\315\233\315\247\314\ +276\315\222\315\253\322\211\322\211\315\211\314\263\314\230\314\255\3 +14\253\315\205\314\257A\315\204\314\212\315\256\315\251\314\205\314\2 +14\315\204\315\256\315\247\315\236\314\265\315\200\315\201\314\270\31 +5\231\314\251\314\261\314\235\315\231\314\261\314\253\314\234\315\231 +\314\260\314\273\314\235\315\225\315\211\314\255\314\256\314\226\315\ +226\315\207I\314\276\315\233\314\205\314\211\314\241\315\201\314\236\ +314\227\315\210\315\231\314\240\315\223\315\211\314\257\314\235\314\2 +56\314\262\314\256\315\225\315\205\314\243\314\255\314\252 </tt> <meta charset=\"utf-8\"> ";

    The link I linked before warns against using encodings in which the bytes corresponding to "<script>" in ASCII can encode a different string, like JIS_C6226-1983, JIS_X0212-1990, HZ-GB-2312, JOHAB (Windows code page 1361), encodings based on ISO-2022, and encodings based on EBCDIC..authors must not use the CESU-8, UTF-7, BOCU-1 and SCSU encodings

    So the order doesn't matter

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1165675]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others rifling through the Monastery: (4)
As of 2024-04-25 12:19 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found