Actually, I would be a little surprised to find a BOM in
combination with UTF-8 (as the encoding is just a sequence of
bytes). Normally, you'd find BOMs with the "ucs-2" encodings,
as used by Windows in many places. With those, we have a 16-bit value per char, and thus the internal byte ordering matters.
Anyway, what you could try is something like this (not sure if this
is the most elegant way, but it should work... Update: it isn't :) - apparently there's File::BOM)
sub openfile_unicode {
my $filename = shift;
open my $fh, "<:raw", $filename or die "Cannot open $filename: $!\
+n";
my $bom;
read $fh, $bom, 2;
if ($bom eq "\xff\xfe" || $bom eq "\xfe\xff") { # BOM present?
# if so, determine if little- or big-endian
my $encoding = "ucs-2" . ($bom eq "\xff\xfe" ? "le":"be");
binmode $fh, ":encoding($encoding)";
} else { # otherwise assume UTF-8
# reopen file
close $fh;
$fh = undef;
open $fh, "<:encoding(utf8)", $filename or die "Cannot open $f
+ilename: $!\n";
}
return $fh;
}
my $fh = openfile_unicode("somefile");
while (my $line = <$fh>) {
# ...
}