http://www.perlmonks.org?node_id=1035409


in reply to Re^7: UTF-8 text files with Byte Order Mark
in thread UTF-8 text files with Byte Order Mark

I'm trying my best to understand this thread, but I'm having difficulty.
I'm dealing with the same issue where Notepad seems to add the BOM to the beginning of UTF-8 files. I've tried deleting it using all these commands, none of which works:

s/chr(0xEFBBBF)//; #remove Byte Order Mark
s/\x{EFBBBF}//;
s/^chr(0xFEFF)//;
s/^\x{FEFF}//;

Another clue: When I was using Strawberry Perl, I was able to use \x{064E} to refer to an Arabic vowel marker, and that worked. But now I'm using ActiveState, and that no longer works.
But I haven't been able to reference the BOM using either Strawberry or Active State. So I'm wondering if there's some sort of package I need to reference in order to make Perl recognize the \x{NNNN} format. Any suggestions?
Thanks,
  • Comment on Re^8: UTF-8 text files with Byte Order Mark

Replies are listed 'Best First'.
Re^9: UTF-8 text files with Byte Order Mark
by ikegami (Patriarch) on May 29, 2013 at 20:18 UTC
    The last one is the correct one. It will remove the BOM after it's been decoded.
Re^9: UTF-8 text files with Byte Order Mark
by Anonymous Monk on May 29, 2013 at 08:19 UTC

    I'm trying my best to understand this thread, but I'm having difficulty.

    Please stop trying, there is nothing for you here, read Tutorials/perlunitut: Unicode in Perl, perlunitut, use via:File::BOM

    I've tried deleting it using all these commands, none of which works:

    Please stop that :) Read perlunitut, use via:File::BOM , it will decode your file and remove the BOM for you

    If you've got raw data you want to share you can use

    perl -MData::Dump -MFile::Slurp -e " dd scalar read_file shift, { qw/ +binmode :raw / }; " AnyKindOfInputFile > ThatFilesBytesAsPerlAsciiCo +de.pl

    The different ways BOM can look

    $ perl -MFile::BOM -MData::Dump -e " dd \%File::BOM::enc2bom " { # tied Readonly::Hash "iso-10646-1" => "\xFE\xFF", "UCS-2" => "\xFE\xFF", "UTF-16BE" => "\xFE\xFF", "UTF-16LE" => "\xFF\xFE", "UTF-32BE" => "\0\0\xFE\xFF", "UTF-32LE" => "\xFF\xFE\0\0", "UTF-8" => "\xEF\xBB\xBF", "utf8" => "\xEF\xBB\xBF", }