Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW

Re: UTF-8 text files with Byte Order Mark

by Joost (Canon)
on Feb 13, 2007 at 18:01 UTC ( #599732=note: print w/replies, xml ) Need Help??

in reply to UTF-8 text files with Byte Order Mark

A BOM is part of the text and it's a (sort of) valid character "ZERO WIDTH NON-BREAKING SPACE". Your best bet is just to strip it off since it's use (aside from providing a BOM) isn't recommended anyway:
while (my $line = <>) { $line =~ /^\x{FEFF}//; # strip BOM # rest }

Replies are listed 'Best First'.
Re^2: UTF-8 text files with Byte Order Mark
by muba (Priest) on Feb 13, 2007 at 20:21 UTC

    Yeah, this works, except that the BOM indeed is a three-bytes thing as said above. So the code, that seems to work, now looks like this:

    while (my $line = <$rulesFH>) { if ($. == 1) { # Remove Byte Order Mark if it's there use Encode; my $octets = encode("utf8", $line); $octets =~ s/^\x{ef}\x{bb}\x{bf}//; $line = decode("utf8", $octets); } # rest... }
      my $octets = encode("utf8", $line); $octets =~ s/^\x{ef}\x{bb}\x{bf}//; $line = decode("utf8", $octets);

      is the same thing as

      my $BOM = decode("utf8", "\x{ef}\x{bb}\x{bf}"); $line =~ s/^$BOM//;

      is the same thing as

      my $BOM = chr(0xFEFF); $line =~ s/^$BOM//;

      is the same thing as

      $line =~ s/^\x{FEFF}//;

      which is what I gave you. Much simpler!

        Meh. Indeed, I didn't realise that. Thank you!

        Thank you!!! This saved me a lot of trouble. I am also trying to strip out these UTF-8 byte order mark characters, which google docs puts in by default to downloaded text files. By the way I found that \x{FEFF} was not the same as \x{ef}\x{bb}\x{bf}

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://599732]
choroba is at a security training
[choroba]: the presenter was able to fix an expired certificate in their demo application, but now he's getting java stacktrace instead of the pages
[virtualsue]: Thank you, Discipulus. I must now go write some code. :-)
marto wanders off to think of a suitable to write on a retirement card

How do I use this? | Other CB clients
Other Users?
Others chanting in the Monastery: (7)
As of 2018-04-24 11:17 GMT
Find Nodes?
    Voting Booth?