http://www.perlmonks.org?node_id=599729


in reply to UTF-8 text files with Byte Order Mark

Actually, I would be a little surprised to find a BOM in combination with UTF-8 (as the encoding is just a sequence of bytes). Normally, you'd find BOMs with the "ucs-2" encodings, as used by Windows in many places. With those, we have a 16-bit value per char, and thus the internal byte ordering matters.

Anyway, what you could try is something like this (not sure if this is the most elegant way, but it should work...   Update: it isn't :) - apparently there's File::BOM)

sub openfile_unicode { my $filename = shift; open my $fh, "<:raw", $filename or die "Cannot open $filename: $!\ +n"; my $bom; read $fh, $bom, 2; if ($bom eq "\xff\xfe" || $bom eq "\xfe\xff") { # BOM present? # if so, determine if little- or big-endian my $encoding = "ucs-2" . ($bom eq "\xff\xfe" ? "le":"be"); binmode $fh, ":encoding($encoding)"; } else { # otherwise assume UTF-8 # reopen file close $fh; $fh = undef; open $fh, "<:encoding(utf8)", $filename or die "Cannot open $f +ilename: $!\n"; } return $fh; } my $fh = openfile_unicode("somefile"); while (my $line = <$fh>) { # ... }

Replies are listed 'Best First'.
Re^2: UTF-8 text files with Byte Order Mark
by Joost (Canon) on Feb 13, 2007 at 17:53 UTC

      The test file seems to match that three-byte BOM indeed.

      I'm happy to know you don't usualy see utf-8 files with a BOM, but as pointed out below, some programs still store it, such as Notepad. One of my users seems to have a utf-8 file with a BOM too.

Re^2: UTF-8 text files with Byte Order Mark
by ikegami (Patriarch) on Feb 13, 2007 at 18:08 UTC
    File::BOM does the same thing (and does it better?)
Re^2: UTF-8 text files with Byte Order Mark
by ikegami (Patriarch) on Feb 13, 2007 at 18:05 UTC

    Actually, I would be a little surprised to find a BOM in combination with UTF-8 (as the encoding is just a sequence of bytes).

    notepad adds a BOM when you save as UTF-8.

Re^2: UTF-8 text files with Byte Order Mark
by Anonymous Monk on Mar 18, 2010 at 06:37 UTC
    Many text editors use BOM to distinguish ASCII or local-encoding from UTF