Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

Re: UTF-8 text files with Byte Order Mark

by almut (Canon)
on Feb 13, 2007 at 17:50 UTC ( #599729=note: print w/ replies, xml ) Need Help??


in reply to UTF-8 text files with Byte Order Mark

Actually, I would be a little surprised to find a BOM in combination with UTF-8 (as the encoding is just a sequence of bytes). Normally, you'd find BOMs with the "ucs-2" encodings, as used by Windows in many places. With those, we have a 16-bit value per char, and thus the internal byte ordering matters.

Anyway, what you could try is something like this (not sure if this is the most elegant way, but it should work...   Update: it isn't :) - apparently there's File::BOM)

sub openfile_unicode { my $filename = shift; open my $fh, "<:raw", $filename or die "Cannot open $filename: $!\ +n"; my $bom; read $fh, $bom, 2; if ($bom eq "\xff\xfe" || $bom eq "\xfe\xff") { # BOM present? # if so, determine if little- or big-endian my $encoding = "ucs-2" . ($bom eq "\xff\xfe" ? "le":"be"); binmode $fh, ":encoding($encoding)"; } else { # otherwise assume UTF-8 # reopen file close $fh; $fh = undef; open $fh, "<:encoding(utf8)", $filename or die "Cannot open $f +ilename: $!\n"; } return $fh; } my $fh = openfile_unicode("somefile"); while (my $line = <$fh>) { # ... }


Comment on Re: UTF-8 text files with Byte Order Mark
Select or Download Code
Re^2: UTF-8 text files with Byte Order Mark
by Joost (Canon) on Feb 13, 2007 at 17:53 UTC

      The test file seems to match that three-byte BOM indeed.

      I'm happy to know you don't usualy see utf-8 files with a BOM, but as pointed out below, some programs still store it, such as Notepad. One of my users seems to have a utf-8 file with a BOM too.

Re^2: UTF-8 text files with Byte Order Mark
by ikegami (Pope) on Feb 13, 2007 at 18:05 UTC

    Actually, I would be a little surprised to find a BOM in combination with UTF-8 (as the encoding is just a sequence of bytes).

    notepad adds a BOM when you save as UTF-8.

Re^2: UTF-8 text files with Byte Order Mark
by ikegami (Pope) on Feb 13, 2007 at 18:08 UTC
    File::BOM does the same thing (and does it better?)
Re^2: UTF-8 text files with Byte Order Mark
by Anonymous Monk on Mar 18, 2010 at 06:37 UTC
    Many text editors use BOM to distinguish ASCII or local-encoding from UTF

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://599729]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others rifling through the Monastery: (11)
As of 2014-09-19 05:30 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (129 votes), past polls