in reply to UTF-8 text files with Byte Order Mark
Actually, I would be a little surprised to find a BOM in combination with UTF-8 (as the encoding is just a sequence of bytes). Normally, you'd find BOMs with the "ucs-2" encodings, as used by Windows in many places. With those, we have a 16-bit value per char, and thus the internal byte ordering matters.
Anyway, what you could try is something like this (not sure if this is the most elegant way, but it should work... Update: it isn't :) - apparently there's File::BOM)
sub openfile_unicode { my $filename = shift; open my $fh, "<:raw", $filename or die "Cannot open $filename: $!\ +n"; my $bom; read $fh, $bom, 2; if ($bom eq "\xff\xfe" || $bom eq "\xfe\xff") { # BOM present? # if so, determine if little- or big-endian my $encoding = "ucs-2" . ($bom eq "\xff\xfe" ? "le":"be"); binmode $fh, ":encoding($encoding)"; } else { # otherwise assume UTF-8 # reopen file close $fh; $fh = undef; open $fh, "<:encoding(utf8)", $filename or die "Cannot open $f +ilename: $!\n"; } return $fh; } my $fh = openfile_unicode("somefile"); while (my $line = <$fh>) { # ... }
|
---|
Replies are listed 'Best First'. | |
---|---|
Re^2: UTF-8 text files with Byte Order Mark
by Joost (Canon) on Feb 13, 2007 at 17:53 UTC | |
by muba (Priest) on Feb 13, 2007 at 20:03 UTC | |
Re^2: UTF-8 text files with Byte Order Mark
by ikegami (Patriarch) on Feb 13, 2007 at 18:08 UTC | |
Re^2: UTF-8 text files with Byte Order Mark
by ikegami (Patriarch) on Feb 13, 2007 at 18:05 UTC | |
Re^2: UTF-8 text files with Byte Order Mark
by Anonymous Monk on Mar 18, 2010 at 06:37 UTC |
In Section
Seekers of Perl Wisdom