Beefy Boxes and Bandwidth Generously Provided by pair Networks DiBona
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Remove BOM ?

by zod (Scribe)
on Nov 19, 2008 at 05:53 UTC ( #724474=perlquestion: print w/ replies, xml ) Need Help??
zod has asked for the wisdom of the Perl Monks concerning the following question:

Greetings all,

I want to remove a utf8 BOM from a .pl file. I can do it by using an another perl script to process the file like so:

while ( my $line = <IN> ) { $line =~ s/^\x{FEFF}//; print OUT $line;
But why can't I just do it like this in vim when editing the file itself?
s/^\x{FEFF}//g
That doesn't match the BOM. I'm a rookie, so maybe I'm making a rookie mistake.

Thanks, zod

UPDATE: Just for the sake of completeness if anyone wonders about this later: The BOM is not considered part of the text of the file you are editing in VIM. Vim strips the BOM when the file is opened and sets the local 'bomb' option to remember that it must be added back when writing the file. So you can't pattern match it because it isn't there while you have the file open in VIM. See here.

Comment on Remove BOM ?
Select or Download Code
Re: Remove BOM ?
by ikegami (Pope) on Nov 19, 2008 at 06:04 UTC

    A few possibilities.

    • vim treats the file as being encoded using a different encoding then the one that was really used, so it doesn't see char U+FEFF.
    • vim doesn't understand the \x{} notation.

    What does this have to do with Perl?

Re: Remove BOM ?
by graff (Chancellor) on Nov 19, 2008 at 06:31 UTC
    One of my co-workers (a vi user) happened to point out to me that he was seeing the BOM as a sequence of 3 bytes, displayed as hex digit strings like this:
    \xef\xbb\xbf
    In any case, the safest, surest, easiest way to remove the BOM (IMHO) is a perl one-liner:
    perl -CD -pe 'tr/\x{feff}//d' file.bom > file.nobom

    UPDATE: Just for grins, I tried vi on my macosx and also on a freebsd box (same as my co-worker used), On the mac, the BOM was not visible, and I couldn't seem to position the cursor or issue a basic "delete" command in any way that could affect the BOM itself. On freebsd (where it appeared as three hex byte codes), this sequence of three keystrokes got rid of it: "3d." (YMMV)

      you can also edit the file that contains the BOM - using vi, and type :set nobomb then :wq (to save and exit) done ! dror.mikdash@pharmatek.co.il
      Warning about that tr command, it can strip out more than just the BOM. I've seen it strip out some binary characters in an AssemblyInfo.cs file for a copyright symbol. I don't have a fix (yet), so I figured I'd at least warn!
Re: Remove BOM ?
by moritz (Cardinal) on Nov 19, 2008 at 07:39 UTC
    You can do it like you want, if you decode the file:
    open IN, ':<encoding(UTF-8)', $file or die $!; # your code here

    If you don't decode the file, you have to remove the byte sequences that represent the BOM in the encoding that your file has.

    Update: The big difference between vim and perl (in this respect) is that vim tries to auto-detect the character encoding (which is a sane thing to do for a text editor, especially if the text are longer, and represent human language) and decodes the text with the guessed encoding, while perl doesn't try to guess anything (which is a sane thing to do for a general purpose programming language).

Re: Remove BOM ?
by Gangabass (Priest) on Nov 19, 2008 at 07:51 UTC

    :help bomb

Re: Remove BOM ?
by davido (Archbishop) on Oct 01, 2012 at 20:49 UTC

    This little snippet comes from Mojo::JSON:

    # Remove BOM $bytes =~ s/^(?:\357\273\277|\377\376\0\0|\0\0\376\377|\376\377|\377 +\376)//g;

    Dave

      except that /g can't be right. A BOM can only appear as the first few bytes of a data stream. If there is a further BOM then most likely you've got a binary file rather than a text file.

      It's not clear to me what the nulls are doing in there.

      True laziness is hard work

        It depends. Unless all your text processing tools are UNICODE-smart, you can easily end up with a BOM at the beginning of any line, not just the first, and really they could end up anywhere, depending on what you're doing. Imagine using cat and paste on files with BOMs. In my experience (and I've had quite a bit), I almost always end up having to delete BOM-looking strings from the entire file, not just the beginning.

        It's not clear to me what the nulls are doing in there.

        BOM in UTF-32 LE and BE encodings.

        Alexander

        --
        Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://724474]
Approved by graff
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others cooling their heels in the Monastery: (9)
As of 2014-04-17 01:32 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    April first is:







    Results (437 votes), past polls