in reply to Re: perltidy and UTF-8 BOM
in thread perltidy and UTF-8 BOM
The Unicode Standard permits the BOM in UTF-8, but does not require or recommend its use. Byte order has no meaning in UTF-8
what's the reason [...] using a BOM for [...] UTF-8 [...]?
UTF-8 encoded text should - in theory - not need a BOM, that's correct. But there are only very few cases (see below) in which a BOM causes trouble. So many editors (and other text-processing tools) automatically switch to UTF-8 encoding when they find a BOM encoded as UTF-8 (0xEF, 0xBB, 0xBF) at file offset 0. This is often completely analogous to finding a BOM encoded in UTF-16 BE, UTF-16 LE, UTF-32 BE, UTF-32 LE. Without a BOM, they usually guess. UTF-16 and UTF-32 can often be guessed by the amount and position of 0x00 bytes. UTF-8 can also be guessed, but it is harder and can be mixed up with some legacy encoding.
So, prefixing UTF-8 encoded text with a BOM makes life easier for most tools, that's all.
The Unix #! mechanism is broken by a leading BOM, simply because the kernel expects the first two bytes of the file to be 0x23, 0x21. The BOM takes up two to four bytes and is often invisible in editors. The kernel sees an invalid magic number and so does not consider the file as a script, while the user believes that the file starts with #!. (Adding support for scripts with a BOM should be quite easy, by simply treating 0xEF 0xBB 0xBF 0x23 0x21 at the start of a file like 0x23 0x21 at the start of a file.)