Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister

Re^3: Problems parsing UTF16 file

by graff (Chancellor)
on Aug 11, 2012 at 18:19 UTC ( #986915=note: print w/replies, xml ) Need Help??

in reply to Re^2: Problems parsing UTF16 file
in thread Problems parsing UTF16 file

I should mention that if you open the input file like this:
open( $fh, "<:encoding(UTF-16)", $filename );
(that is, without the "LE" in the encoding spec), then you won't need this line:
because the "unmarked" version of UTF-16 encoding requires that a stream-initial BOM be provided on input, and the initial BOM is stripped from input as a result.

For output of UTF-16, if you're trying to match a particular byte order, it'll be best for the code to state this explicitly, because the "default" output order might be different, depending on your machine and environment.

Of course, whenever a file is written with 'UTF-16' encoding, the initial BOM is always included, which should make it possible for any other process to read the file correctly - but of course, not all processes that expect UTF-16LE (or BE) will live up to that specification.

Anyway, when you do decide to be explicit about byte order for an output file, then you should also be sure to include the initial BOM (because it won't be supplied by default). So if you try out the snippet below, see whether there's any difference in the output when you comment out the "UTF-16" open statement and uncomment the two lines that use "UTF-16LE" instead:

open( I, "<:encoding(UTF-16)", $ARGV[0] ) or die "$ARGV[0]: $!"; local $/; $_=<I>; @lines = split /\r\n/; # open(O,">:encoding(UTF-16LE):crlf","$ARGV[0].new") or die "$ARGV[0]. +new:$!"; # print O "\x{feff}"; open(O,">:encoding(UTF-16):crlf","$ARGV[0].new") or die "$ARGV[0].new: +$!"; print O "$_\n" for ( @lines );

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://986915]
[LanX]: ... so my boss started a project with the newest sun servers and invited the traders to come on weekend to test it... and they were so pleased, that they forced him to keep it in production...
[ambrus]: Corion: sure, this is the long-term plan. The short term is that I have to run this ungodly mess to get results from the new input data today.
[Corion]: ambrus: Most of our "automation" is tied to process exit codes and a shell pipeline :-\
[LanX]: ... a week later they realized that one of the databases - which recorded how much the other banks due to this bank - was not correctly plugged
[ambrus]: Corion: I have no problem with exit codes and shell pipeline. My problem is that the current process requires a lot of manual intervention from me, including editing the source codes.
[ambrus]: (Also a lot of manual intervention by two or three other co-workers, who do other parts of the process.)
[ambrus]: Some of the manual part is unavoidable, but not all.
[choroba]: LanX was there a way to recover the numbers from the remaining information?
[Corion]: LanX: Ow ;)

How do I use this? | Other CB clients
Other Users?
Others wandering the Monastery: (12)
As of 2017-03-29 11:55 GMT
Find Nodes?
    Voting Booth?
    Should Pluto Get Its Planethood Back?

    Results (350 votes). Check out past polls.