muba has asked for the wisdom of the Perl Monks concerning the following question:

I'm working on this program that reads two UTF-8 files. One contains a lexicon, the other provides a number of sound change rules. In the end, the program is to apply those rules on the original lexicon and output the soundchanged words.

So far so good. Except that things fail if the BOM (byte order mark) is present in the ruleset file.

I open the file as open my $lexFH, "<:encoding(UTF-8)", $clarg{l} or die "Couldn't open lexicon file $clarg{l}: $!"; so I kinda assume that Perl will handle with this kind of stuff for me.

However, if file contains that BOM, my program does not understand the first line in the file. Ok, so I understand the complete details of why my program has troubles with the line, and in the end it just boils down to the simple fact that it doesn't expect that BOM.

And neither did I. I had hoped that Perl would understand it as part of the utf-8 encoding.

By the way, I read my lines as while (my $line = <$lexFH>) {.

So. The actual question I'm trying to ask is this: how do I make Perl understand the BOM in a way that my program never sees it?

Comment on UTF-8 text files with Byte Order Mark
Select or Download Code
Re: UTF-8 text files with Byte Order Mark
by almut (Canon) on Feb 13, 2007 at 17:50 UTC

    Actually, I would be a little surprised to find a BOM in combination with UTF-8 (as the encoding is just a sequence of bytes). Normally, you'd find BOMs with the "ucs-2" encodings, as used by Windows in many places. With those, we have a 16-bit value per char, and thus the internal byte ordering matters.

    Anyway, what you could try is something like this (not sure if this is the most elegant way, but it should work...   Update: it isn't :) - apparently there's File::BOM)

    sub openfile_unicode { my $filename = shift; open my $fh, "<:raw", $filename or die "Cannot open $filename: $!\ +n"; my $bom; read $fh, $bom, 2; if ($bom eq "\xff\xfe" || $bom eq "\xfe\xff") { # BOM present? # if so, determine if little- or big-endian my $encoding = "ucs-2" . ($bom eq "\xff\xfe" ? "le":"be"); binmode $fh, ":encoding($encoding)"; } else { # otherwise assume UTF-8 # reopen file close $fh; $fh = undef; open $fh, "<:encoding(utf8)", $filename or die "Cannot open $f +ilename: $!\n"; } return $fh; } my $fh = openfile_unicode("somefile"); while (my $line = <$fh>) { # ... }

        The test file seems to match that three-byte BOM indeed.

        I'm happy to know you don't usualy see utf-8 files with a BOM, but as pointed out below, some programs still store it, such as Notepad. One of my users seems to have a utf-8 file with a BOM too.

      Actually, I would be a little surprised to find a BOM in combination with UTF-8 (as the encoding is just a sequence of bytes).

      notepad adds a BOM when you save as UTF-8.

      File::BOM does the same thing (and does it better?)
      Many text editors use BOM to distinguish ASCII or local-encoding from UTF
Re: UTF-8 text files with Byte Order Mark
by ikegami (Pope) on Feb 13, 2007 at 17:55 UTC

    so I kinda assume that Perl will handle with this kind of stuff for me.

    Having Perl remove the BOM automatically would be bad. print while <$fh>; would no longer print out a file exactly, for example. It wouldn't be possible to print out a file exactly by other means either.

    However, if file contains that BOM, my program does not understand the first line in the file

    Patient: "Doctor, it hurts when I do this."
    Doctor: "So don't do it!"

    If your program doesn't accept BOMs, don't feed it any. BOMs are not required.

    Alternatively, you could change your spec and your program to accept it.

    while (<$fh>) { s/\x{FEFF}//g; ... }
      Patient: "Doctor, it hurts when I do this."
      Doctor: "So don't do it!"

      Easy to say, of course, but what if the program one of my users uses stores that BOM anyway? Besides, as pointed out, a BOM in a utf-8 file *are* valid so I feel I should support it. Look, if the user was toying around with malformed files I'd be more than happy to tell him to get that fixed :D but apparently he's doing what he righteously thinks is righs.

        a BOM in a utf-8 file *are* valid

        "!" in an ASCII file is also valid. But if you place a "!" at the start of your Perl program, it probably will not compile. It is a malformed file, not from a UNICODE perspective, but from your parser's perspective.

        I provided two alternatives (removing the BOM and File::BOM) that will work with your broken tools (i.e. tools that add undesirable character to the files you edit). I'd go with them since allowing the BOM is surely a good thing.

Re: UTF-8 text files with Byte Order Mark
by Joost (Canon) on Feb 13, 2007 at 18:01 UTC

      Yeah, this works, except that the BOM indeed is a three-bytes thing as said above. So the code, that seems to work, now looks like this:

      while (my $line = <$rulesFH>) { if ($. == 1) { # Remove Byte Order Mark if it's there use Encode; my $octets = encode("utf8", $line); $octets =~ s/^\x{ef}\x{bb}\x{bf}//; $line = decode("utf8", $octets); } # rest... }
        my $octets = encode("utf8", $line); $octets =~ s/^\x{ef}\x{bb}\x{bf}//; $line = decode("utf8", $octets);

        is the same thing as

        my $BOM = decode("utf8", "\x{ef}\x{bb}\x{bf}"); $line =~ s/^$BOM//;

        is the same thing as

        my $BOM = chr(0xFEFF); $line =~ s/^$BOM//;

        is the same thing as

        $line =~ s/^\x{FEFF}//;

        which is what I gave you. Much simpler!