Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW

Reading binary file in perl having records of different length

by jaypal (Beadle)
on Jun 17, 2014 at 00:14 UTC ( #1090087=perlquestion: print w/replies, xml ) Need Help??
jaypal has asked for the wisdom of the Perl Monks concerning the following question:

Hello Perl Monks,

I am working on an automation project where I have to read a binary file, parse it and print it out in a readable format. The binary file can be up to 4-5 mb big and can contain around 10,000 records. Each record is separated by a 2 byte eye catcher which is == (or 3d3d in hex). I have the template which tells me the length of the fields in the record. I mentioned variable length records due to the last field in the record which can vary from record to record. The good thing is right after the 2 bytes eye catchers is the 2 byte length of the record.

So my approach as of now is, to read first two bytes of the binary file, check if it is my eye catcher, if it is, read the next two bytes which tells me the length of my record. Convert that length in to decimal and use it in read function to read entire record and pass it to parsing subroutine which will split and convert each field of record accordingly. Length of the buffer includes the eye catcher which is why I am doing $length - 4 (2 bytes eye catcher and 2 bytes length already read into buffer)

#!/usr/local/bin/perl use strict; use warnings; open my $fh, '<', 'binary.file' or die "File not found: $!"; binmode($fh); my ($xdr, $buffer, $length) = ""; # read until end of file ... while (read ($fh, $buffer, 2) != 0) { # if file does not start with eye catcher skip until you find one next unless ((unpack 'H*', $buffer) eq "3d3d"); # append the eyecatcher to xdr variable $xdr .= $buffer; # read next two bytes which is length of the record read ($fh, $buffer, 2); # convert the binary length to decimal for use in read function my $length = unpack('s',pack 's', hex(unpack 'H*', $buffer)); # append the length to xdr variable $xdr .= $buffer; # read the binary stream till the length of record - 4 bytes read ($fh, $buffer, $length-4); # append the entire xdr $xdr .= $buffer; #send to parsing subroutine for parsing $xdr = ""; }

My questions are:
1. Is this a good approach? How can I improve this?
2. Will it be better to read the entire binary file in to an array splitting at eye catcher. Will it be a performance hit to read entire file in array?
3. There can be bad records in the file where length could be wrong so I need to put a check that only send the record for parsing if after the entire length of record is read the next two bytes are 3d3d.

If there is anything ambiguous I may have quoted please let me know in the comments and I will update this question to be more clear. I don't have any questions in parsing yet, it is just the reading I am most concerned about. </p?

Looking forward to your wisdom.

Replies are listed 'Best First'.
Re: Reading binary file in perl having records of different length
by Anonymous Monk on Jun 17, 2014 at 00:58 UTC

    1. Yes. Based on your description, your approach looks good, and it's the approach I would have chosen (of a few possible ones). A few minor improvement suggestions below.

    2. Reading the entire file and spliting it might make the code a little "easier", if you can safely say that the "==" sequence never appears anywhere else in the file - otherwise it'll make things more complicated! Also it can be expected that this method would take more memory and probably be slower. I'd stick with your current approach.

    3. You could achieve this by adding some state to your parsing. To do it "right" would require some rewriting of the code. TIMTOWTDI, I'll suggest one possible approach in pseudocode:

    my $expect = 'eyecatcher'; my $record; while (1) { if ($expect eq 'eyecatcher' || $expect eq 'eyecatcher_after_record') + { if (read_two_bytes() eq '==') { process_record($record) if $expect eq 'eyecatcher_after_record'; $record = undef; $expect = 'length'; } else { die "expected eyecatcher" } } elsif ($expect eq 'length') { my $length = read_two_bytes(); $record = read_bytes($length); $expect = 'eyecater_after_record'; } }

    I hope this makes sense. You can break out of the while based on when you hit the end of the file, and you may need to then process any unprocessed final $record.

    A few improvement suggestions to your current code: The most major one is that you don't check the return value of read to make sure that you actually got back the number of bytes you requested, you should probably do that to handle any errors in reading the file (such as premature EOF). A small one: You currently declare $length twice, you can remove the declaration before the loop. Although it doesn't really hurt, I don't think you need the initial unpack, a simple $buffer eq '==' should be enough. Same thing on the second read, a simple unpack('s',$buffer) should be enough. And another minor nit might be that you could declare my $xdr inside the loop, so you don't need to treat it like a global and clear it at the end of every loop.

    Otherwise, good!

      Ack! Learn a lesson from my own mistake and use constants instead of strings for $expect (note the typo "eyecater"). I was being lazy :-(

      Also, the pseudocode doesn't handle the case of the file not beginning with "==", which you could handle in the first else like so: else { die "expected eyecatcher" unless $expect eq 'eyecatcher' }.

      If the logic starts getting too complex, get a little more verbose and break the first if up: if ($expect eq 'eyecatcher') {} elsif ($expect eq 'eyecatcher_after_record') {} and so on. Always cover all branches; at the very least throw a else { die "unexpected" } on there during development.

      And choosing the right names for your states helps a lot. For example, "eyecatcher" might be better named "first_eyecatcher".

        Thank you so much, your improvements to my existing code were great and I am currently trying to modify the code as per your suggestions.

        One thing I have to ensure is while I am reading the record and if it has a bad length, it will read in to the next possibly good record and send that to parsing subroutine and will also prevent me from processing the good record as the eye catcher might have probably been read by previous read command.

        Will explore more ways and get back to you. Thanks again for great comments.

Re: Reading binary file in perl having records of different length
by andal (Hermit) on Jun 17, 2014 at 07:10 UTC

    In general, your approach is fine. Perl function "read" uses streams which in turn use data caching. So, it is fast to read even by 2 bytes. If your files are only few megabytes large, then you may read them completely into string but I don't think it will improve speed, because in my tests, stream was still obtaining data in the same chunks size, so number of system calls didn't change.

    One comment on your use of "pack/unpack". Somehow you overuse it. For example, when searching for eye catcher just do "next unless $buffer eq '==';". When converting binary length just do "my $length = unpack('s', $buffer);".

    One more thing. You don't check the return value of "read". Especially when you read "$length - 4" bytes. The file might be corrupted and you'll never get desired number of bytes. Plus, you say, your records may contain incorrect length, then what would be you strategy for recovery in this situation? Potentially the length may point to the middle of the next record.

      Thanks andal. All great suggestions. The binary length suggestion didn't seem to work in my case. I want to capture highest nibble first.

      For example, the binary length in hex is 0x03 0x50 so I need to read next 848 bytes. However if I do unpack('s', $buffer) it returns 20483 bytes (reading the binary data as 0x50 0x03).

        See pack patterns "n" and "v" instead, or the "<" and ">" modifiers if you really want signed interpretation.

        $ perl -le 'print unpack("v","\x03\x50")' 20483 $ perl -le 'print unpack("n","\x03\x50")' 848 $ perl -le 'print unpack("s<","\x03\x50")' 20483 $ perl -le 'print unpack("s>","\x03\x50")' 848
Re: Reading binary file in perl having records of different length
by Anonymous Monk on Jun 17, 2014 at 20:12 UTC

    I see a brother has pointed out the issue with == alignment. Another point to make is that reading the file as a whole can indeed simplify parsing. Consider:

    $data = do{ undef $/; <$fh> }; while ($data =~ m/==/) { ($rec, $data) = unpack("n/A A*", $'); process($rec); }

      Thanks for the snippet. Yes that is an option I am exploring as well. Since I don't have a production grade binary yet (application is still being developed) I only have a binary file with 5 records (so can't test the performance benefit).
      What I am doing is creating a separate parsing subroutine which would expect one record at a time for parsing, so how I read is independent of parsing. I am putting a user defined choice at run time if the user wants to read the binary in slurp mode or byte mode. This is what I have for slurp mode:

      #!/usr/bin/perl use strict; use warnings; use Fcntl qw(:seek); use Data::Dumper; open my $fh, '<', 'Test.NEW' or die "File not found: $!"; binmode($fh); my $data = do{ undef $/; <$fh> }; my @data = split /(?=3d3d)/, unpack ('H*',$data); for my $xdr (@data) { open (my $fh1, '<', \$xdr); # create a filehandle from sc +alar read ($fh1, my $buffer, 2); # read the eye catcher read ($fh1, $buffer, 2); # read the length my $length = unpack 'n', $buffer; # identify the length in decimal seek $fh1, 0, 0; # reset the offset read ($fh1, $buffer, $length); #read till the length to preve +nt garbage bytes process($buffer); } #print Dumper \@data;
        I don't have a production grade binary yet

        Since you are dealing with binary data I don't think your "eyecatcher" is a good idea as "\x3d\x3d" ("==") could legitimately be part of your data. I think it better to rely on a record starting with a byte count immediately followed by a fixed length header string that can easily be identified and validated, perhapd by regular expression, e.g. /^Record\s\d{5}$/ for "Record 00001", "Record 02784" etc. The chance of such a string appearing in the binary data is very much less likely and should make unravelling bad records far easier.

        I don't know if you have any control over the format of the binary files but I feel that the "==" between records is just storing up trouble and should be reconsidered. It is too short to be unlikely to appear in the data and, by preceding the record, adds complications to record alignment.



Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1090087]
Approved by LanX
Front-paged by perlfan
erix likes the term condescension detection

How do I use this? | Other CB clients
Other Users?
Others pondering the Monastery: (8)
As of 2018-05-24 14:38 GMT
Find Nodes?
    Voting Booth?