Re: UTF-8 text files with Byte Order Mark
by almut (Canon) on Feb 13, 2007 at 17:50 UTC
|
Actually, I would be a little surprised to find a BOM in
combination with UTF-8 (as the encoding is just a sequence of
bytes). Normally, you'd find BOMs with the "ucs-2" encodings,
as used by Windows in many places. With those, we have a 16-bit value per char, and thus the internal byte ordering matters.
Anyway, what you could try is something like this (not sure if this
is the most elegant way, but it should work... Update: it isn't :) - apparently there's File::BOM)
sub openfile_unicode {
my $filename = shift;
open my $fh, "<:raw", $filename or die "Cannot open $filename: $!\
+n";
my $bom;
read $fh, $bom, 2;
if ($bom eq "\xff\xfe" || $bom eq "\xfe\xff") { # BOM present?
# if so, determine if little- or big-endian
my $encoding = "ucs-2" . ($bom eq "\xff\xfe" ? "le":"be");
binmode $fh, ":encoding($encoding)";
} else { # otherwise assume UTF-8
# reopen file
close $fh;
$fh = undef;
open $fh, "<:encoding(utf8)", $filename or die "Cannot open $f
+ilename: $!\n";
}
return $fh;
}
my $fh = openfile_unicode("somefile");
while (my $line = <$fh>) {
# ...
}
| [reply] [d/l] [select] |
|
| [reply] |
|
The test file seems to match that three-byte BOM indeed.
I'm happy to know you don't usualy see utf-8 files with a BOM, but as pointed out below, some programs still store it, such as Notepad. One of my users seems to have a utf-8 file with a BOM too.
| [reply] |
|
File::BOM does the same thing (and does it better?)
| [reply] |
|
| [reply] [d/l] |
|
Many text editors use BOM to distinguish ASCII or local-encoding from UTF
| [reply] |
Re: UTF-8 text files with Byte Order Mark
by Joost (Canon) on Feb 13, 2007 at 18:01 UTC
|
A BOM is part of the text and it's a (sort of) valid character "ZERO WIDTH NON-BREAKING SPACE". Your best bet is just to strip it off since it's use (aside from providing a BOM) isn't recommended anyway:
while (my $line = <>) {
$line =~ /^\x{FEFF}//; # strip BOM
# rest
}
| [reply] [d/l] |
|
Yeah, this works, except that the BOM indeed is a three-bytes thing as said above. So the code, that seems to work, now looks like this:
while (my $line = <$rulesFH>) {
if ($. == 1) {
# Remove Byte Order Mark if it's there
use Encode;
my $octets = encode("utf8", $line);
$octets =~ s/^\x{ef}\x{bb}\x{bf}//;
$line = decode("utf8", $octets);
}
# rest...
}
| [reply] [d/l] |
|
my $octets = encode("utf8", $line);
$octets =~ s/^\x{ef}\x{bb}\x{bf}//;
$line = decode("utf8", $octets);
is the same thing as
my $BOM = decode("utf8", "\x{ef}\x{bb}\x{bf}");
$line =~ s/^$BOM//;
is the same thing as
my $BOM = chr(0xFEFF);
$line =~ s/^$BOM//;
is the same thing as
$line =~ s/^\x{FEFF}//;
which is what I gave you. Much simpler!
| [reply] [d/l] [select] |
|
|
|
|
Re: UTF-8 text files with Byte Order Mark
by ikegami (Patriarch) on Feb 13, 2007 at 17:55 UTC
|
so I kinda assume that Perl will handle with this kind of stuff for me.
Having Perl remove the BOM automatically would be bad. print while <$fh>; would no longer print out a file exactly, for example. It wouldn't be possible to print out a file exactly by other means either.
However, if file contains that BOM, my program does not understand the first line in the file
Patient: "Doctor, it hurts when I do this."
Doctor: "So don't do it!"
If your program doesn't accept BOMs, don't feed it any. BOMs are not required.
Alternatively, you could change your spec and your program to accept it.
while (<$fh>) {
s/\x{FEFF}//g;
...
}
| [reply] [d/l] [select] |
|
Patient: "Doctor, it hurts when I do this."
Doctor: "So don't do it!"
Easy to say, of course, but what if the program one of my users uses stores that BOM anyway? Besides, as pointed out, a BOM in a utf-8 file *are* valid so I feel I should support it. Look, if the user was toying around with malformed files I'd be more than happy to tell him to get that fixed :D but apparently he's doing what he righteously thinks is righs.
| [reply] |
|
a BOM in a utf-8 file *are* valid
"!" in an ASCII file is also valid. But if you place a "!" at the start of your Perl program, it probably will not compile. It is a malformed file, not from a UNICODE perspective, but from your parser's perspective.
I provided two alternatives (removing the BOM and File::BOM) that will work with your broken tools (i.e. tools that add undesirable character to the files you edit). I'd go with them since allowing the BOM is surely a good thing.
| [reply] [d/l] [select] |
|
|
| [reply] |
Re: UTF-8 text files with Byte Order Mark
by Anonymous Monk on Dec 08, 2022 at 08:45 UTC
|
I agree that this should be done automatically if the UTF-8 IO layer is specified. The fact that UTF-8 files with a BOM are rare make this more important. I'm willing to bet that there are many Perl scripts out there that read UTF-8 files and that will break the first time they encounter a file with a BOM.
| [reply] |
Re: UTF-8 text files with Byte Order Mark
by freonpsandoz (Beadle) on Sep 19, 2016 at 03:57 UTC
|
If your program doesn't accept BOMs, don't feed it any. BOMs are not required.
BOMs are required in some types of UTF-8 files. Try loading a UTF-8 cue sheet or m3u8 playlist without a BOM into Foobar2000 sometime...
| [reply] |