Great! Thanks for reporting your solution back. :)
I think the issue was solely due to the input not being UTF-8 aware; it thought the BOM was ISO-8859 (i.e. the three characters ""); then when you wrote with UTF-8 awareness, they were translated into the appropriate UTF-8 sequence (C3 AF C2 BB C2 BF), which, when read as UTF-8, translates to the codepoints for "" ..!
I tested with this:
our $/;
open(my $in, "<", "myfile");
open(my $out, ">", "myoutfile");
my $d = <$in>;
print $out $d;
close $out;
close $in;
"myfile" has the content:
0000000: efbb bf68 656c 6c6f 2c20 776f 726c 640a ...hello, world.
With the code above, Perl neither tries to interpret the BOM as a BOM in reading or writing, and "myoutfile" winds up like this:
0000000: efbb bf68 656c 6c6f 2c20 776f 726c 64 ...hello, world
(identical!) If we decide to interpret the input (only) as UTF-8, however, the BOM is interpreted as a UTF-8 sequence, and we get a warning about "Wide character in print" when trying to print it out to a filehandle that doesn't know about UTF-8:
$ perl test.pl
Wide character in print at test.pl line 10, <$in> line 1.
$
"myoutfile" still has the BOM prepended (is Perl just trying a UTF-8 representation?) in this case. The other notable thing when reading in with "<:utf8" is the value of ord($d): 0xFEFF. If we didn't use utf8, it comes out as 0xEF.
Using utf8 on both streams causes the BOM to be faithfully read in and written out; and using utf8 only on output tries to write the individual letters as they would be interpreted in ISO-8859 with in UTF-8:
0000000: c3af c2bb c2bf 6865 6c6c 6f2c 2077 6f72 ......hello, wor
0000010: 6c64 ld
Fun times! |