Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked

Problems parsing UTF16 file

by stu23 (Initiate)
on Aug 10, 2012 at 04:38 UTC ( #986655=perlquestion: print w/replies, xml ) Need Help??
stu23 has asked for the wisdom of the Perl Monks concerning the following question:

Hope I can get some advice on solving this problem. I have a file from a windows machine. It is encoded UTF16-LE with BOM of <FFFE> followed by text data. ALL data is in the format <4200> and the end of lines are <0d00> <0A00>. Each line is CVS. I need to read each line of the file, do some checking of some specific fields, and write to a new file some of the data with modifications. My problem is I cannot parse on the CR/LF. Below is a test script I have written (I am not an experience perl programmer) which shows the different approaches I have tried. All can read the file and all print the @array just fine, but none of them recognize the end of file. I have a small test file but I am not sure how to post it.

#!/usr/local/bin/perl # # use strict; use warnings; use charnames qw( :full ); my @segment_array; #use File::BOM(); #this tells script to use the Byte Order Mark in r +eading the files, but it is not on the system I am using my $file_segment_name = "TestFile1.svd"; # examining the file in hex, it is utf8 encoded, with a Byte order Mar +ker set at FFFE #read the files # open (FH_SEGMENT_FILE, "< $file_segment_name") || ERROR('open' +, 'segment file'); # open (FH_SEGMENT_FILE, '<:encoding(UTF16-LE)', $file_segment_n +ame) || ERROR('open', 'segment file'); # open (FH_SEGMENT_FILE, '<:raw:perlio:encoding(UTF16-LE):crlf', + $file_segment_name) || ERROR('open', 'segment file'); open (FH_SEGMENT_FILE, '< $file_segment_name' )|| ERROR('open +', 'segment file'); # binmode (FH_SEGMENT_FILE, '<:crlf: encoding(UTF16-LE) ' ); # open (FH_SEGMENT_FILE, '<:raw:crlf: encoding(UTF16-LE) ', $fil +e_segment_name ); # open (FH_SEGMENT_FILE, '< :crlf :encoding(UTF16)', $file_segme +nt_name); @segment_array=<FH_SEGMENT_FILE>; close(FH_SEGMENT_FILE); #print the file - it prints correctly print "@segment_array"; print "\n\n"; #put some spaces in for (my $i = 1; $i <=20 ; $i++){ my $segment_array= shift(@segment_array);; print "$segment_array[$i]"; } exit; #subs below this point #************************ #------------------------- sub ERROR () { print "Sever can't $_[0] the $_[1] \n"; } #----------------------------

I don't know how to post the file and keep the encoding. So below is some of the file displayed using vi in the hex mode.

0000000: fffe 4000 4100 6900 7200 4d00 6100 6700 ..@.A.i.r.M.a.g. 0000010: 6e00 6500 7400 2000 5300 7500 7200 7600 n.e.t. .S.u.r.v. 0000020: 6500 7900 2000 4400 6100 7400 6100 0d00 e.y. .D.a.t.a... 0000030: 0a00 2300 5400 7900 7000 6500 3a00 2000 ..#.T.y.p.e.:. . 0000040: 7000 6100 7300 7300 6900 7600 6500 0d00 p.a.s.s.i.v.e... 0000050: 0a00 2300 4100 7000 7000 2000 5600 6500 ..#.A.p.p. .V.e. 0000060: 7200 7300 6900 6f00 6e00 3a00 2000 3800 r.s.i.o.n.:. .8. 0000070: 2e00 3200 2000 0900 2000 4200 7500 6900 ..2. ... .B.u.i. 0000080: 6c00 6400 3a00 2000 3200 3500 3400 3600 l.d.:. . 0000090: 3000 0d00 0a00 2300 4300 7200 6500 6100 0.....#.C.r.e.a. 00000a0: 7400 6500 6400 2000 6f00 6e00 3a00 2000 t.e.d. .o.n.:. . 00000b0: 3000 3900 3a00 3100 3300 3a00 3400 3700 0.9.:.1.3.:.4.7. 00000c0: 2000 3000 3400 2f00 3100 3000 2f00 3200 .0.4./.1.0./.2. 00000d0: 3000 3100 3200 0d00 0a00 2300 4300 6100 0.1.2.....#.C.a. 00000e0: 7200 6400 2000 4e00 6100 6d00 6500 2a00 r.d. .N.a.m.e.*. 00000f0: 3a00 2000 5500 6200 6900 7100 7500 6900 :. .U.b.i.q.u.i. 0000100: 7400 6900 2000 4e00 6500 7400 7700 6f00 t.i. .N.e.t.w.o. 0000110: 7200 6b00 7300 2000 5300 5200 2d00 3700 r.k.s. .S.R.-.7. 0000120: 3100 2d00 5500 5300 4200 2000 5700 6900 1.-.U.S.B. .W.i. 0000130: 7200 6500 6c00 6500 7300 7300 2000 4100 r.e.l.e.s.s. .A. 0000140: 6400 6100 7000 7400 6500 7200 2000 3000 d.a.p.t.e.r. .0. 0000150: 3000 3a00 3100 3500 3a00 3600 4400 3a00 0.:.1.5.:.6.D.:. 0000160: 3800 3400 3a00 4500 3100 3a00 4600 4100 8.4.:.E.1.:.F.A. 0000170: 0900 2000 4f00 5300 5600 6500 7200 7300 .. .O.S.V.e.r.s. 0000180: 6900 6f00 6e00 3a00 2000 3600 2e00 3100 i.o.n.:. .6...1.

when i run the program, the print @array looks like this:

@AirMagnet Survey Data #Type: passive #App Version: 8.2 Build: 25460 #Created on: 09:13:47 04/10/2012 #Card Name*: Ubiquiti Networks SR-71-USB Wireless Adapter 00:15:6D:84 +:E1:FA OSVersion: 6.100002 1 #Antenna Angle: 0.000000, Antenna Type: #dim_X, dim_Y, GPS Map &,6351.008789,3142.447021, 1 #Time,Xpos,Ypos,Channel,SSID,AP,SignalDBM,Signal,NoiseDBM,Noise,Media +Type,NodeName,Speed,ByteCount(throughput),PacketCount,PacketLost,Lost +Rate,RetryCount,RetryRate,Longitude,Latitude,Click,APFlags,MCSRx-Tx,I +PerfSpeed,Heading, AntennaDirection, iPerf_Throughput_Up, iPerf_Throu +ghput_Down 1334063627,4144.148438,1767.801514,11,'xfinitywifi','C4:0A:CB:68:B9:8 +1',-80,20,-94,1,'802.11gn','X1G025_W004','0','-1','-1','-1','-1','-1' +,'-1',-7311.503300, 4051.325100,*,131,3855,0,0.000000, 0.000000

but the second section ALWAYS looks like this. Alternate lines are missed

#App Version: 8.2 Build: 25460 #Card Name*: Ubiquiti Networks SR-71-USB Wireless Adapter 00:15:6D:84: +E1:FA OSVersion: 6.100002 1 #dim_X, dim_Y, GPS Map #Time,Xpos,Ypos,Channel,SSID,AP,SignalDBM,Signal,NoiseDBM,Noise,MediaT +ype,NodeName,Speed,ByteCount(throughput),PacketCount,PacketLost,LostR +ate,RetryCount,RetryRate,Longitude,Latitude,Click,APFlags,MCSRx-Tx,IP +erfSpeed,Heading, AntennaDirection, iPerf_Throughput_Up, iPerf_Throug +hput_Down 1334063627,4144.148438,1767.801514,11,'optimumwifi','C4:0A:CB:68:B9:80 +',-80,20,-94,1,'802.11gn','X1G025_W004','0','-1','-1','-1','-1','-1', +'-1',-7311.503300, 4051.325100,*,131,3855,0,0.000000, 0.000000 1334063627,4144.148438,1767.801514,6,'Smithtown','0C:D5:02:68:50:3F',- +87,12,-94,1,'802.11g','0C:D5:02:68:50:3F','0','-1','-1','-1','-1','-1 +','-1',-7311.503300, 4051.325100,*,1,0,0,0.000000, 0.000000 1334063627,4144.148438,1767.801514,11,'Unknown','98:FC:11:90:FA:D0',-8 +9,9,-94,1,'802.11gn','98:FC:11:90:FA:D0','0','-1','-1','-1','-1','-1' +,'-1',-7311.503300, 4051.325100,*,131,3855,0,0.000000, 0.000000

Replies are listed 'Best First'.
Re: Problems parsing UTF16 file
by Khen1950fx (Canon) on Aug 10, 2012 at 06:34 UTC
    This article should clear this up for you.

      thanks for the link to the thread. I have tried to read the file with the two ways shown in the code below. Neither correctly identifies the end of line (cr/lf). The thread explained the layers, but something must be still wrong. I got a suggestion to use file::BOM, but it is not on the shared server I am using and the IT guy does not want to install it. Would that be a forward step on this problem?

      # open (FH_SEGMENT_FILE, '<:crlf:encoding(UTF-16LE):raw', $file_ +segment_name ); open (FH_SEGMENT_FILE, '<:raw:encoding(UTF-16LE):crlf ', $file +_segment_name ); @segment_array=<FH_SEGMENT_FILE>; close(FH_SEGMENT_FILE);
Re: Problems parsing UTF16 file
by graff (Chancellor) on Aug 10, 2012 at 16:57 UTC
    Have you tried something like this?
    my $file_segment_name = "TestFile1.svd"; open( I, "<:encoding(UTF-16LE)", $file_segment_name ) or die "$file_segment_name: $!"; local $/; $_=<I>; # slurp full file content in one read s/\x{feff}//; # remove BOM @lines = split /\r\n/; print "line #$_ : $lines[$_] (EOL)\n" for ( 0 .. $#lines );
    (worked for me)

      Thanks graff - that worked for me also. Now to figure out why it works!! stu

        I should mention that if you open the input file like this:
        open( $fh, "<:encoding(UTF-16)", $filename );
        (that is, without the "LE" in the encoding spec), then you won't need this line:
        because the "unmarked" version of UTF-16 encoding requires that a stream-initial BOM be provided on input, and the initial BOM is stripped from input as a result.

        For output of UTF-16, if you're trying to match a particular byte order, it'll be best for the code to state this explicitly, because the "default" output order might be different, depending on your machine and environment.

        Of course, whenever a file is written with 'UTF-16' encoding, the initial BOM is always included, which should make it possible for any other process to read the file correctly - but of course, not all processes that expect UTF-16LE (or BE) will live up to that specification.

        Anyway, when you do decide to be explicit about byte order for an output file, then you should also be sure to include the initial BOM (because it won't be supplied by default). So if you try out the snippet below, see whether there's any difference in the output when you comment out the "UTF-16" open statement and uncomment the two lines that use "UTF-16LE" instead:

        open( I, "<:encoding(UTF-16)", $ARGV[0] ) or die "$ARGV[0]: $!"; local $/; $_=<I>; @lines = split /\r\n/; # open(O,">:encoding(UTF-16LE):crlf","$ARGV[0].new") or die "$ARGV[0]. +new:$!"; # print O "\x{feff}"; open(O,">:encoding(UTF-16):crlf","$ARGV[0].new") or die "$ARGV[0].new: +$!"; print O "$_\n" for ( @lines );

        Thanks to the group for all your help. I have two approaches that work and some tutorial about layers. As to my immediate problem, I can press on. But I want to dig deeper into this and understand why the two approaches work. My problem is not actually done - I can read the files OK. But after I have modified the content, I need to write it back in the same format. But I think I am OK for now. Again, thanks to the group. Stu23

Re: Problems parsing UTF16 file
by Anonymous Monk on Aug 10, 2012 at 09:37 UTC

    I don't know how to post the file and keep the encoding. So below is some of the file displayed using vi in the hex mode.

    Like this

    #!/usr/bin/perl -- use strict; use warnings; use Data::Dump qw' pp '; use Encode::Detective qw' detect '; my $file = shift or die "Usage: $0 filename > \n"; my $data = do { open my($fh), '<:raw' , $file or die $!; local $/; <$fh>; }; my $encoding = detect($data); print q{my $data = }, pp($data), "; open my(\$fh), '<:$encoding', \\\$data or die; ... "; __END__ my $data = "\xFE\xFF\0h\0i\0\r\0\n"; open my($fh), '<:UTF-16BE', \$data or die; ...

      Mr Anonymous Monk, sir. Your code works fine also. Thank you very much. Stu

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://986655]
Approved by bulk88
[marto]: Wolfsbane , now I'm having flashbacks
[choroba]: Isn't Using PerlPod Creatively rather a meditation?
[choroba]: I don't see a question
[1nickt]: ugh, I stuck my head in the bass bin for 30 seconds on a dare at Ted Nugent at Hammersmith Odeon. Yes, I am 40% deaf now.
[johngg]: My daughter is incredibly jealous of my wife who got to see The Clash at Brixton many years ago. They went to see The Vaccines there recently.
[1nickt]: But the bands are even louder! I saw Spearhead (Michael Franti) at an outdoor show and had to walk a mile away to not feel pain in my chest! Babies were crying ... I asked the sound engineer why it was necessary to have the bass so loud and he laughed...
[Discipulus]: but the best i attended live was Mano Negra Patchanka at Forte Prenestino .. in 1990

How do I use this? | Other CB clients
Other Users?
Others imbibing at the Monastery: (9)
As of 2017-03-24 12:13 GMT
Find Nodes?
    Voting Booth?
    Should Pluto Get Its Planethood Back?

    Results (301 votes). Check out past polls.