Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

Re^5: UTF-8 text files with Byte Order Mark

by ikegami (Pope)
on Oct 01, 2011 at 21:53 UTC ( #929073=note: print w/ replies, xml ) Need Help??


in reply to Re^4: UTF-8 text files with Byte Order Mark
in thread UTF-8 text files with Byte Order Mark

By the way I found that \x{FEFF} was not the same as \x{ef}\x{bb}\x{bf}

Yeah, "\x{ef}\x{bb}\x{bf}" is the UTF-8 encoding of the BOM / U+FEFF / "\x{FEFF}".


Comment on Re^5: UTF-8 text files with Byte Order Mark
Re^6: UTF-8 text files with Byte Order Mark
by Anonymous Monk on May 23, 2012 at 13:19 UTC

    Please stop confusing. FEFF has nothing to do with UTF-8. This is a BOM for UTF-16 Big Endian-encoded files.

      This is a BOM for UTF-16 Big Endian-encoded files.

      You are mistaken. It's the BOM, period. It can be encoded using UTF-8 and UTF-16le just as easily as with UTF-16be.

      $ perl -MEncode -e'print encode("UTF-8", chr(0xFEFF))' | od -t x1 0000000 ef bb bf 0000003 $ perl -MEncode -e'print encode("UTF-16be", chr(0xFEFF))' | od -t x1 0000000 fe ff 0000002 $ perl -MEncode -e'print encode("UTF-16le", chr(0xFEFF))' | od -t x1 0000000 ff fe 0000002
      FEFFBOM
      2B,2F,76,38,2DBOM encoded using UTF-7
      EF,BB,BFBOM encoded using UTF-8
      FE,FFBOM encoded using UTF-16be
      FF,FEBOM encoded using UTF-16le
      00,00,FE,FFBOM encoded using UTF-32be
      FF,FE,00,00BOM encoded using UTF-32le

      So you won't find FE,FF in a UTF-8 file, but just like in a UTF-16be file, you can find an encoded FEFF in a UTF-8 file.

        I'm trying my best to understand this thread, but I'm having difficulty.
        I'm dealing with the same issue where Notepad seems to add the BOM to the beginning of UTF-8 files. I've tried deleting it using all these commands, none of which works:

        s/chr(0xEFBBBF)//; #remove Byte Order Mark
        s/\x{EFBBBF}//;
        s/^chr(0xFEFF)//;
        s/^\x{FEFF}//;

        Another clue: When I was using Strawberry Perl, I was able to use \x{064E} to refer to an Arabic vowel marker, and that worked. But now I'm using ActiveState, and that no longer works.
        But I haven't been able to reference the BOM using either Strawberry or Active State. So I'm wondering if there's some sort of package I need to reference in order to make Perl recognize the \x{NNNN} format. Any suggestions?
        Thanks,

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://929073]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others exploiting the Monastery: (11)
As of 2014-07-23 16:40 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (147 votes), past polls