Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

Re: Don't want BOM in output file

by anneli (Pilgrim)
on Oct 15, 2011 at 09:34 UTC ( [id://931640]=note: print w/replies, xml ) Need Help??


in reply to Don't want BOM in output file

Perl won't write the BOM for you; it sounds like that must be a part of your output. If it's being written in a different representation (in UTF-8), perhaps that's Perl's UTF-8 translating the BOM as was read (mistakenly) in the input to the valid sequence for those ISO-8859 characters in the output.

Replies are listed 'Best First'.
Re^2: Don't want BOM in output file
by beerman (Novice) on Oct 17, 2011 at 15:29 UTC
    Yes, I found the problem. The data did have the UTF8 BOM. I didn't notice the BOM until I analyzed the bytes (od -x). The issue was that perl was converting the BOM (EFBBBF) to ISO-88591-1 even though I indicated that my output should be UTF8 (open, ">:utf8", $name). The fix was to also open the input with the utf8 encoding. That is my original statement was open (INPUT, "< $inputfile") so I changed that to open (INPUT, "<:utf8", "$inputfile"). Thanks to all for the help with this. Once I realized that the input file really had the BOM, the fix was easy.

      Great! Thanks for reporting your solution back. :)

      I think the issue was solely due to the input not being UTF-8 aware; it thought the BOM was ISO-8859 (i.e. the three characters ""); then when you wrote with UTF-8 awareness, they were translated into the appropriate UTF-8 sequence (C3 AF C2 BB C2 BF), which, when read as UTF-8, translates to the codepoints for "" ..!

      I tested with this:

      our $/; open(my $in, "<", "myfile"); open(my $out, ">", "myoutfile"); my $d = <$in>; print $out $d; close $out; close $in;

      "myfile" has the content:

      0000000: efbb bf68 656c 6c6f 2c20 776f 726c 640a  ...hello, world.

      With the code above, Perl neither tries to interpret the BOM as a BOM in reading or writing, and "myoutfile" winds up like this:

      0000000: efbb bf68 656c 6c6f 2c20 776f 726c 64    ...hello, world

      (identical!) If we decide to interpret the input (only) as UTF-8, however, the BOM is interpreted as a UTF-8 sequence, and we get a warning about "Wide character in print" when trying to print it out to a filehandle that doesn't know about UTF-8:

      $ perl test.pl Wide character in print at test.pl line 10, <$in> line 1. $

      "myoutfile" still has the BOM prepended (is Perl just trying a UTF-8 representation?) in this case. The other notable thing when reading in with "<:utf8" is the value of ord($d): 0xFEFF. If we didn't use utf8, it comes out as 0xEF.

      Using utf8 on both streams causes the BOM to be faithfully read in and written out; and using utf8 only on output tries to write the individual letters as they would be interpreted in ISO-8859 with in UTF-8:

      0000000: c3af c2bb c2bf 6865 6c6c 6f2c 2077 6f72 ......hello, wor 0000010: 6c64 ld

      Fun times!

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://931640]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others pondering the Monastery: (5)
As of 2024-04-24 09:09 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found