Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

Re: UTF-8 text files with Byte Order Mark

by almut (Canon)
on Feb 13, 2007 at 17:50 UTC ( #599729=note: print w/replies, xml ) Need Help??


in reply to UTF-8 text files with Byte Order Mark

Actually, I would be a little surprised to find a BOM in combination with UTF-8 (as the encoding is just a sequence of bytes). Normally, you'd find BOMs with the "ucs-2" encodings, as used by Windows in many places. With those, we have a 16-bit value per char, and thus the internal byte ordering matters.

Anyway, what you could try is something like this (not sure if this is the most elegant way, but it should work...   Update: it isn't :) - apparently there's File::BOM)

sub openfile_unicode { my $filename = shift; open my $fh, "<:raw", $filename or die "Cannot open $filename: $!\ +n"; my $bom; read $fh, $bom, 2; if ($bom eq "\xff\xfe" || $bom eq "\xfe\xff") { # BOM present? # if so, determine if little- or big-endian my $encoding = "ucs-2" . ($bom eq "\xff\xfe" ? "le":"be"); binmode $fh, ":encoding($encoding)"; } else { # otherwise assume UTF-8 # reopen file close $fh; $fh = undef; open $fh, "<:encoding(utf8)", $filename or die "Cannot open $f +ilename: $!\n"; } return $fh; } my $fh = openfile_unicode("somefile"); while (my $line = <$fh>) { # ... }

Replies are listed 'Best First'.
Re^2: UTF-8 text files with Byte Order Mark
by Joost (Canon) on Feb 13, 2007 at 17:53 UTC

      The test file seems to match that three-byte BOM indeed.

      I'm happy to know you don't usualy see utf-8 files with a BOM, but as pointed out below, some programs still store it, such as Notepad. One of my users seems to have a utf-8 file with a BOM too.

Re^2: UTF-8 text files with Byte Order Mark
by ikegami (Pope) on Feb 13, 2007 at 18:08 UTC
    File::BOM does the same thing (and does it better?)
Re^2: UTF-8 text files with Byte Order Mark
by ikegami (Pope) on Feb 13, 2007 at 18:05 UTC

    Actually, I would be a little surprised to find a BOM in combination with UTF-8 (as the encoding is just a sequence of bytes).

    notepad adds a BOM when you save as UTF-8.

Re^2: UTF-8 text files with Byte Order Mark
by Anonymous Monk on Mar 18, 2010 at 06:37 UTC
    Many text editors use BOM to distinguish ASCII or local-encoding from UTF

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://599729]
help
Chatterbox?
[marto]: I'll be interested to see what difference in terms of file size/quality the new codecs make. I enjoy working on things like this, so that's a bonus
[Corion]: Ah, cool! So it's not an inhouse youtube but for a wider consumption
[marto]: last time it was IE6 clients, now they're on 11, so more scope there also, in terms of UI and playback
[marto]: in house as in Company network, not internet connected (power stations and the corporate HQs)
[Discipulus]: new Monsignor party! free beverages and pizza for all you monks! and a big thanks for the patience you deserved me during these years
[marto]: congrats Discipulus
[Discipulus]: only 4k points to sundial's level ..;=)
[Corion]: Yay Discipulus ;)
[Corion]: marto: Ah, so this will be training videos etc., cool!
[choroba]: Congratulations!

How do I use this? | Other CB clients
Other Users?
Others musing on the Monastery: (11)
As of 2017-07-28 09:21 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    I came, I saw, I ...
























    Results (425 votes). Check out past polls.