Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

Re: perltidy and UTF-8 BOM

by AnomalousMonk (Archbishop)
on May 21, 2018 at 04:42 UTC ( [id://1214960]=note: print w/replies, xml ) Need Help??


in reply to perltidy and UTF-8 BOM

Total UTF-8 n00b here, but I thought the "byte order" idea of this encoding was "one byte after another from the beginning of the text stream to the end". (Of course, each character in this encoding can be one to four bytes, but the order of the bytes in a character is invariant.) Indeed, this source sez WRT UTF-8 byte order:

The Unicode Standard permits the BOM in UTF-8, but does not require or recommend its use. Byte order has no meaning in UTF-8 ... [emphasis added]
Out of curiosity, what's the reason you're using a BOM for your UTF-8 source code?


Give a man a fish:  <%-{-{-{-<

Replies are listed 'Best First'.
Re^2: perltidy and UTF-8 BOM
by afoken (Chancellor) on May 22, 2018 at 20:00 UTC

    The Unicode Standard permits the BOM in UTF-8, but does not require or recommend its use. Byte order has no meaning in UTF-8

    what's the reason [...] using a BOM for [...] UTF-8 [...]?

    UTF-8 encoded text should - in theory - not need a BOM, that's correct. But there are only very few cases (see below) in which a BOM causes trouble. So many editors (and other text-processing tools) automatically switch to UTF-8 encoding when they find a BOM encoded as UTF-8 (0xEF, 0xBB, 0xBF) at file offset 0. This is often completely analogous to finding a BOM encoded in UTF-16 BE, UTF-16 LE, UTF-32 BE, UTF-32 LE. Without a BOM, they usually guess. UTF-16 and UTF-32 can often be guessed by the amount and position of 0x00 bytes. UTF-8 can also be guessed, but it is harder and can be mixed up with some legacy encoding.

    So, prefixing UTF-8 encoded text with a BOM makes life easier for most tools, that's all.

    The Unix #! mechanism is broken by a leading BOM, simply because the kernel expects the first two bytes of the file to be 0x23, 0x21. The BOM takes up two to four bytes and is often invisible in editors. The kernel sees an invalid magic number and so does not consider the file as a script, while the user believes that the file starts with #!. (Adding support for scripts with a BOM should be quite easy, by simply treating 0xEF 0xBB 0xBF 0x23 0x21 at the start of a file like 0x23 0x21 at the start of a file.)

    https://validator.w3.org/ warns if input starts with a BOM, claiming that old editors and old browsers have problems with the BOM.

    Alexander

    --
    Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)

      UTF-8 can also be guessed, but it is harder and can be mixed up with some legacy encoding.

      Not really. The problem is the amount of lookahead needed. With a BOM, one can be sure after reading just a few bytes.

      Also, while JSON strings are required to not have a leading BOM, consumers *should* be able to handle it, according to the spec. However, of Perl's JSON libraries, only Cpanel::JSON::XS handles the case without exception.


      The way forward always starts with a minimal test.
Re^2: perltidy and UTF-8 BOM
by morelenmir (Beadle) on May 21, 2018 at 13:02 UTC

    A good question!!! I have found in the past when transferring the same text file--encoded with UTF-8--between different editors, unless it had a BOM then a corrupted display of the non-ASCII characters would occur. I encountered this especially between either 'EditPad Pro' or the newer versions of 'Notepad' and 'Programmer's File Editor'. The latter is now a very old but in its day extremely handy text editor which I used for the majority of the 2000's. I also had issues with non-BOM unicode text and the free version of 'Take Command Console' which I use exclusively instead of the native console in Windows. This is a generally excellent 'DOS' replacement but it does not support UTF-8--so again without a BOM I found weird things happened and the last time I spoke to the chap who writes TCC he was pretty militant about only offering UTF-16 output from his console commands. So just as a carte blanche fix I applied a BOM to all unicoded files whether UTF-8 or UTF-16 and never considered it again. These days I use EPP for all my editing so probably could live without it, but there would be a lot of files to re-edit and remove the BOM from! Even then I'd still run in to issues with TCC however as I also launch the Perl runtime and debugger through it. At the end of the day, as you say UTF-8 shouldn't need a BOM but I have found--other than perltidy!!!--that employing one helps more than it hinders.

    I will try that idea of stripping and then reapplying the BOM.

    As an aside: I am afraid I do not know how to quote and reply to individual posts in this forum system so I am having to do so en masse. My apologies if this appears somewhat confusing because of it.

    "Aure Entuluva!" - Hurin Thalion at the Nirnaeth Arnoediad.
      I am afraid I do not know how to quote and reply to individual posts in this forum system so I am having to do so en masse.

      In the thread view, every post has a [reply] link in the bottom right corner of the post.

      In the individual view, there's the Comment on link underneath the post. (Update: Tux pointed out that this link could also be above the post, thanks!)

        vvv here vvv or here --->
      ... how to quote ...

      I'm not sure this is what you're referring to, but there is a  <blockquote> ... quoted text ... </blockquote> tag that this site supports; see Markup in the Monastery and Writeup Formatting Tips. (Many monks, including myself, embed italics tags within the blockquote tags to further distinguish the quote. The end result is then <blockquote><i> ... </i></blockquote>.)


      Give a man a fish:  <%-{-{-{-<

Re^2: perltidy and UTF-8 BOM
by ikegami (Patriarch) on May 22, 2018 at 19:30 UTC

    A BOM is frequently used to identify UTF-8 files even if the concept of byte order doesn't exist in UTF-8. Remember, the BOM is really just U+FEFF ZERO WIDTH NO-BREAK SPACE, an completely invisible character.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1214960]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others perusing the Monastery: (3)
As of 2024-04-19 05:33 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found