Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

Re^5: UTF-8 text files with Byte Order Mark

by ikegami (Pope)
on Oct 01, 2011 at 21:53 UTC ( #929073=note: print w/ replies, xml ) Need Help??


in reply to Re^4: UTF-8 text files with Byte Order Mark
in thread UTF-8 text files with Byte Order Mark

By the way I found that \x{FEFF} was not the same as \x{ef}\x{bb}\x{bf}

Yeah, "\x{ef}\x{bb}\x{bf}" is the UTF-8 encoding of the BOM / U+FEFF / "\x{FEFF}".


Comment on Re^5: UTF-8 text files with Byte Order Mark
Re^6: UTF-8 text files with Byte Order Mark
by Anonymous Monk on May 23, 2012 at 13:19 UTC

    Please stop confusing. FEFF has nothing to do with UTF-8. This is a BOM for UTF-16 Big Endian-encoded files.

      This is a BOM for UTF-16 Big Endian-encoded files.

      You are mistaken. It's the BOM, period. It can be encoded using UTF-8 and UTF-16le just as easily as with UTF-16be.

      $ perl -MEncode -e'print encode("UTF-8", chr(0xFEFF))' | od -t x1 0000000 ef bb bf 0000003 $ perl -MEncode -e'print encode("UTF-16be", chr(0xFEFF))' | od -t x1 0000000 fe ff 0000002 $ perl -MEncode -e'print encode("UTF-16le", chr(0xFEFF))' | od -t x1 0000000 ff fe 0000002
      FEFFBOM
      2B,2F,76,38,2DBOM encoded using UTF-7
      EF,BB,BFBOM encoded using UTF-8
      FE,FFBOM encoded using UTF-16be
      FF,FEBOM encoded using UTF-16le
      00,00,FE,FFBOM encoded using UTF-32be
      FF,FE,00,00BOM encoded using UTF-32le

      So you won't find FE,FF in a UTF-8 file, but just like in a UTF-16be file, you can find an encoded FEFF in a UTF-8 file.

        I'm trying my best to understand this thread, but I'm having difficulty.
        I'm dealing with the same issue where Notepad seems to add the BOM to the beginning of UTF-8 files. I've tried deleting it using all these commands, none of which works:

        s/chr(0xEFBBBF)//; #remove Byte Order Mark
        s/\x{EFBBBF}//;
        s/^chr(0xFEFF)//;
        s/^\x{FEFF}//;

        Another clue: When I was using Strawberry Perl, I was able to use \x{064E} to refer to an Arabic vowel marker, and that worked. But now I'm using ActiveState, and that no longer works.
        But I haven't been able to reference the BOM using either Strawberry or Active State. So I'm wondering if there's some sort of package I need to reference in order to make Perl recognize the \x{NNNN} format. Any suggestions?
        Thanks,

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://929073]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others studying the Monastery: (20)
As of 2015-07-07 13:45 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (88 votes), past polls