Beefy Boxes and Bandwidth Generously Provided by pair Networks Russ
Just another Perl shrine
 
PerlMonks  

Re^6: UTF-8 text files with Byte Order Mark

by Anonymous Monk
on May 23, 2012 at 13:19 UTC ( #972035=note: print w/ replies, xml ) Need Help??


in reply to Re^5: UTF-8 text files with Byte Order Mark
in thread UTF-8 text files with Byte Order Mark

Please stop confusing. FEFF has nothing to do with UTF-8. This is a BOM for UTF-16 Big Endian-encoded files.


Comment on Re^6: UTF-8 text files with Byte Order Mark
Re^7: UTF-8 text files with Byte Order Mark
by ikegami (Pope) on May 23, 2012 at 17:39 UTC

    This is a BOM for UTF-16 Big Endian-encoded files.

    You are mistaken. It's the BOM, period. It can be encoded using UTF-8 and UTF-16le just as easily as with UTF-16be.

    $ perl -MEncode -e'print encode("UTF-8", chr(0xFEFF))' | od -t x1 0000000 ef bb bf 0000003 $ perl -MEncode -e'print encode("UTF-16be", chr(0xFEFF))' | od -t x1 0000000 fe ff 0000002 $ perl -MEncode -e'print encode("UTF-16le", chr(0xFEFF))' | od -t x1 0000000 ff fe 0000002
    FEFFBOM
    2B,2F,76,38,2DBOM encoded using UTF-7
    EF,BB,BFBOM encoded using UTF-8
    FE,FFBOM encoded using UTF-16be
    FF,FEBOM encoded using UTF-16le
    00,00,FE,FFBOM encoded using UTF-32be
    FF,FE,00,00BOM encoded using UTF-32le

    So you won't find FE,FF in a UTF-8 file, but just like in a UTF-16be file, you can find an encoded FEFF in a UTF-8 file.

      I'm trying my best to understand this thread, but I'm having difficulty.
      I'm dealing with the same issue where Notepad seems to add the BOM to the beginning of UTF-8 files. I've tried deleting it using all these commands, none of which works:

      s/chr(0xEFBBBF)//; #remove Byte Order Mark
      s/\x{EFBBBF}//;
      s/^chr(0xFEFF)//;
      s/^\x{FEFF}//;

      Another clue: When I was using Strawberry Perl, I was able to use \x{064E} to refer to an Arabic vowel marker, and that worked. But now I'm using ActiveState, and that no longer works.
      But I haven't been able to reference the BOM using either Strawberry or Active State. So I'm wondering if there's some sort of package I need to reference in order to make Perl recognize the \x{NNNN} format. Any suggestions?
      Thanks,

        I'm trying my best to understand this thread, but I'm having difficulty.

        Please stop trying, there is nothing for you here, read Tutorials/perlunitut: Unicode in Perl, perlunitut, use via:File::BOM

        I've tried deleting it using all these commands, none of which works:

        Please stop that :) Read perlunitut, use via:File::BOM , it will decode your file and remove the BOM for you

        If you've got raw data you want to share you can use

        perl -MData::Dump -MFile::Slurp -e " dd scalar read_file shift, { qw/ +binmode :raw / }; " AnyKindOfInputFile > ThatFilesBytesAsPerlAsciiCo +de.pl

        The different ways BOM can look

        $ perl -MFile::BOM -MData::Dump -e " dd \%File::BOM::enc2bom " { # tied Readonly::Hash "iso-10646-1" => "\xFE\xFF", "UCS-2" => "\xFE\xFF", "UTF-16BE" => "\xFE\xFF", "UTF-16LE" => "\xFF\xFE", "UTF-32BE" => "\0\0\xFE\xFF", "UTF-32LE" => "\xFF\xFE\0\0", "UTF-8" => "\xEF\xBB\xBF", "utf8" => "\xEF\xBB\xBF", }
        The last one is the correct one. It will remove the BOM after it's been decoded.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://972035]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others meditating upon the Monastery: (8)
As of 2013-06-20 03:32 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How many continents have you visited?









    Results (678 votes), past polls