Beefy Boxes and Bandwidth Generously Provided by pair Networks Bob
Perl: the Markov chain saw
 
PerlMonks  

Re^8: UTF-8 text files with Byte Order Mark

by silentq (Novice)
on May 27, 2013 at 13:44 UTC ( #1035409=note: print w/ replies, xml ) Need Help??


in reply to Re^7: UTF-8 text files with Byte Order Mark
in thread UTF-8 text files with Byte Order Mark

I'm trying my best to understand this thread, but I'm having difficulty.
I'm dealing with the same issue where Notepad seems to add the BOM to the beginning of UTF-8 files. I've tried deleting it using all these commands, none of which works:

s/chr(0xEFBBBF)//; #remove Byte Order Mark
s/\x{EFBBBF}//;
s/^chr(0xFEFF)//;
s/^\x{FEFF}//;

Another clue: When I was using Strawberry Perl, I was able to use \x{064E} to refer to an Arabic vowel marker, and that worked. But now I'm using ActiveState, and that no longer works.
But I haven't been able to reference the BOM using either Strawberry or Active State. So I'm wondering if there's some sort of package I need to reference in order to make Perl recognize the \x{NNNN} format. Any suggestions?
Thanks,


Comment on Re^8: UTF-8 text files with Byte Order Mark
Re^9: UTF-8 text files with Byte Order Mark
by Anonymous Monk on May 29, 2013 at 08:19 UTC

    I'm trying my best to understand this thread, but I'm having difficulty.

    Please stop trying, there is nothing for you here, read Tutorials/perlunitut: Unicode in Perl, perlunitut, use via:File::BOM

    I've tried deleting it using all these commands, none of which works:

    Please stop that :) Read perlunitut, use via:File::BOM , it will decode your file and remove the BOM for you

    If you've got raw data you want to share you can use

    perl -MData::Dump -MFile::Slurp -e " dd scalar read_file shift, { qw/ +binmode :raw / }; " AnyKindOfInputFile > ThatFilesBytesAsPerlAsciiCo +de.pl

    The different ways BOM can look

    $ perl -MFile::BOM -MData::Dump -e " dd \%File::BOM::enc2bom " { # tied Readonly::Hash "iso-10646-1" => "\xFE\xFF", "UCS-2" => "\xFE\xFF", "UTF-16BE" => "\xFE\xFF", "UTF-16LE" => "\xFF\xFE", "UTF-32BE" => "\0\0\xFE\xFF", "UTF-32LE" => "\xFF\xFE\0\0", "UTF-8" => "\xEF\xBB\xBF", "utf8" => "\xEF\xBB\xBF", }
Re^9: UTF-8 text files with Byte Order Mark
by ikegami (Pope) on May 29, 2013 at 20:18 UTC
    The last one is the correct one. It will remove the BOM after it's been decoded.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1035409]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others lurking in the Monastery: (6)
As of 2014-04-21 02:07 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    April first is:







    Results (489 votes), past polls