in reply to Re: perltidy and UTF-8 BOM
in thread perltidy and UTF-8 BOM

The Unicode Standard permits the BOM in UTF-8, but does not require or recommend its use. Byte order has no meaning in UTF-8

what's the reason [...] using a BOM for [...] UTF-8 [...]?

UTF-8 encoded text should - in theory - not need a BOM, that's correct. But there are only very few cases (see below) in which a BOM causes trouble. So many editors (and other text-processing tools) automatically switch to UTF-8 encoding when they find a BOM encoded as UTF-8 (0xEF, 0xBB, 0xBF) at file offset 0. This is often completely analogous to finding a BOM encoded in UTF-16 BE, UTF-16 LE, UTF-32 BE, UTF-32 LE. Without a BOM, they usually guess. UTF-16 and UTF-32 can often be guessed by the amount and position of 0x00 bytes. UTF-8 can also be guessed, but it is harder and can be mixed up with some legacy encoding.

So, prefixing UTF-8 encoded text with a BOM makes life easier for most tools, that's all.

The Unix #! mechanism is broken by a leading BOM, simply because the kernel expects the first two bytes of the file to be 0x23, 0x21. The BOM takes up two to four bytes and is often invisible in editors. The kernel sees an invalid magic number and so does not consider the file as a script, while the user believes that the file starts with #!. (Adding support for scripts with a BOM should be quite easy, by simply treating 0xEF 0xBB 0xBF 0x23 0x21 at the start of a file like 0x23 0x21 at the start of a file.) warns if input starts with a BOM, claiming that old editors and old browsers have problems with the BOM.


Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)

Replies are listed 'Best First'.
Re^3: perltidy and UTF-8 BOM
by ikegami (Pope) on May 22, 2018 at 20:15 UTC

    UTF-8 can also be guessed, but it is harder and can be mixed up with some legacy encoding.

    Not really. The problem is the amount of lookahead needed. With a BOM, one can be sure after reading just a few bytes.

Re^3: perltidy and UTF-8 BOM
by 1nickt (Abbot) on May 23, 2018 at 12:18 UTC

    Also, while JSON strings are required to not have a leading BOM, consumers *should* be able to handle it, according to the spec. However, of Perl's JSON libraries, only Cpanel::JSON::XS handles the case without exception.

    The way forward always starts with a minimal test.