Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask

Re: Problems parsing UTF16 file

by graff (Chancellor)
on Aug 10, 2012 at 16:57 UTC ( #986776=note: print w/replies, xml ) Need Help??

in reply to Problems parsing UTF16 file

Have you tried something like this?
my $file_segment_name = "TestFile1.svd"; open( I, "<:encoding(UTF-16LE)", $file_segment_name ) or die "$file_segment_name: $!"; local $/; $_=<I>; # slurp full file content in one read s/\x{feff}//; # remove BOM @lines = split /\r\n/; print "line #$_ : $lines[$_] (EOL)\n" for ( 0 .. $#lines );
(worked for me)

Replies are listed 'Best First'.
Re^2: Problems parsing UTF16 file
by stu23 (Initiate) on Aug 10, 2012 at 17:42 UTC

    Thanks graff - that worked for me also. Now to figure out why it works!! stu

      I should mention that if you open the input file like this:
      open( $fh, "<:encoding(UTF-16)", $filename );
      (that is, without the "LE" in the encoding spec), then you won't need this line:
      because the "unmarked" version of UTF-16 encoding requires that a stream-initial BOM be provided on input, and the initial BOM is stripped from input as a result.

      For output of UTF-16, if you're trying to match a particular byte order, it'll be best for the code to state this explicitly, because the "default" output order might be different, depending on your machine and environment.

      Of course, whenever a file is written with 'UTF-16' encoding, the initial BOM is always included, which should make it possible for any other process to read the file correctly - but of course, not all processes that expect UTF-16LE (or BE) will live up to that specification.

      Anyway, when you do decide to be explicit about byte order for an output file, then you should also be sure to include the initial BOM (because it won't be supplied by default). So if you try out the snippet below, see whether there's any difference in the output when you comment out the "UTF-16" open statement and uncomment the two lines that use "UTF-16LE" instead:

      open( I, "<:encoding(UTF-16)", $ARGV[0] ) or die "$ARGV[0]: $!"; local $/; $_=<I>; @lines = split /\r\n/; # open(O,">:encoding(UTF-16LE):crlf","$ARGV[0].new") or die "$ARGV[0]. +new:$!"; # print O "\x{feff}"; open(O,">:encoding(UTF-16):crlf","$ARGV[0].new") or die "$ARGV[0].new: +$!"; print O "$_\n" for ( @lines );

      Thanks to the group for all your help. I have two approaches that work and some tutorial about layers. As to my immediate problem, I can press on. But I want to dig deeper into this and understand why the two approaches work. My problem is not actually done - I can read the files OK. But after I have modified the content, I need to write it back in the same format. But I think I am OK for now. Again, thanks to the group. Stu23

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://986776]
[Corion]: That is easy without having to pay for a house, a wife or children though. If I had any of these, or any two of these, the decision wouldn't be that easy.
[ambrus]: wait. I understand no wife and children, but how do you not have to pay for a house?
[hippo]: I had a low-paid job about 20 years ago and seriously considered going down to a 3-day week. Would have worked 60% of the time for about 80% of the cash.
[Corion]: ambrus: Well, I pay rent, but don't own a house with variable/ unforeseeable costs
[hippo]: The thesholds/benefits balance at the time was nuts.
[hippo]: But that doesn't last because, you know ... politicians. :(
[Corion]: hippo: Yeah, if you still get enough money to get by and don't have expensive hobbies or other fixed costs that can work out well
[ambrus]: ah good. I pay only rent too. but that still costs significant money.
[Corion]: ambrus: Yes, but that is already budgeted for

How do I use this? | Other CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (13)
As of 2017-09-21 15:17 GMT
Find Nodes?
    Voting Booth?
    During the recent solar eclipse, I:

    Results (249 votes). Check out past polls.