Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

Optimizing binary file parser (pack/unpack)

by pwagyi (Acolyte)
on Oct 03, 2017 at 07:20 UTC ( #1200576=perlquestion: print w/replies, xml ) Need Help??
pwagyi has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks,

I'm processing binary file which could contain millions of records. Record format is fixed-length-header, body+.

Currently unpack is used to unpack binary data. From NYTprof benchmark, a lot of time is spent in read and unpack call (which is expected). I'm not sure whether if there is any optimized version of pack/unpack (like compiled regex) where pack/unpack template is compiled (potentially speedier).

  • Comment on Optimizing binary file parser (pack/unpack)

Replies are listed 'Best First'.
Re: Optimizing binary file parser (pack/unpack)
by ikegami (Pope) on Oct 03, 2017 at 08:00 UTC

    If most of your time is spent in the pack/unpack, that's a good thing! You're probably using them very effectively.

Re: Optimizing binary file parser (pack/unpack)
by Corion (Pope) on Oct 03, 2017 at 08:12 UTC

    Depending on whether you actually need to unpack each record or not, you can maybe hardcode the offsets to reject rows before actually unpacking them. index can quickly look at a position in a string without needing pack or unpack.

    Depending on how you unpack things, it might be quicker to build one large unpack template instead of unpacking items in a loop.

      I've to unpack all fields of records. Yes I'm already grouping one large template as much as possible. But there are optional fields in record so that had to be handled as well.
Re: Optimizing binary file parser (pack/unpack)
by salva (Abbot) on Oct 03, 2017 at 08:51 UTC
    Consider also writing a XS module for unpacking the data. It should be relatively easy if you know how to program in C.
Re: Optimizing binary file parser (pack/unpack)
by BrowserUk (Pope) on Oct 03, 2017 at 13:26 UTC

    Show some code, there might be some things that can be optimised.

    If you have/can install Inline::C, it can speed processing binary records much quicker. Especially if you move optional fields logic into C.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority". The enemy of (IT) success is complexity.
    In the absence of evidence, opinion is indistinguishable from prejudice. Suck that fhit
      Here is pseudocode.

      The record fields can be either fixed or array. Fixed data type: Character, Integeral types (Unsigned short, Unsigned long), Float Variable data type: String(pascal style C/a), Array of Unsigned short, etc Optional fields at end of record can be omitted. So if there is a record with optional (Byte, C/a(String), Float), that needs to be handled somehow.

      read_file_header determine endian from file_header data set up unpack data types(for big/little endian) if(endian eq 'little') { $REAL_TEMPLATE = "R<"; $U2 = "n"; $REC_HEAD = ... } else { $U2 = "v"; $REC_HEAD = ... ...etc } while(1) { read (REC_HEAD) size data from file ($rec_len,$rec_type,....) = unpack($REC_HEAD) my $rec_body = read($rec_len) # a big switch on rec_type if($rec_type == FOO) { # unpack record_body somehow for THIS rec type # FOO can be # below 4 fields must be present in record body # fixed uSHORT,uLONG,uLONG,string(C/a), # below are optional # optional Byte(Optional onwards this field), Float,String my @data = unpack(" $uSHORT $uLONG $uLONG C/a",$rec_body); my $consumed_length = 10 + length($data[-1]) + 1; # ushort +2* +ulong + length(C/a) if($consumed_length < $rec_len) { # optional fields present push @data, unpack("x${consumed_length} C",$rec_body); $consumed_length += 1; } if($consumed_length < $rec_len) { push @data,unpack("x${consumed_length} $Float",$data); $consumed_length += 4; # float is 4 bytes } # next optional ..etc } elsif($rec_type == BAR) { } }
Re: Optimizing binary file parser (pack/unpack)
by Marshall (Abbot) on Oct 03, 2017 at 21:54 UTC
    It would be helpful if you gave more specs about this file.

    The last time I worked with a binary file in Perl, it was to concatenate some .WAV files together. The .WAV has a header and then some number of binary bytes of data. The number of data bytes are specified in the header info. It was not necessary for me to unpack all of the data, just parts of the binary read in the header which were relevant to the size of the data that followed amoungst other params. I selected the key parts of the binary header via substr to get a ranges of bytes and used pack/unpack upon them.

    I think BrowserUk is on the right track here Re: Optimizing binary file parser (pack/unpack).

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1200576]
Approved by Eily
Front-paged by haukex
help
Chatterbox?
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others musing on the Monastery: (5)
As of 2017-12-17 02:16 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    What programming language do you hate the most?




















    Results (461 votes). Check out past polls.

    Notices?