http://www.perlmonks.org?node_id=1133633

Monk::Thomas has asked for the wisdom of the Perl Monks concerning the following question:

Hello fellow monks

I wrote a parser library for a specific class of binary files (resource files for a video game). It converts the file into a human readable data structure. (hashes of hashes of array of hashes kinda thing; data fields that are only relevant for parsing the binary data streams are stripped from the result)

One of the data types it must be able to handle are 'flags' - a variable length sequence of bytes, where the actual value is uninteresting, the interesting part is whether a certain bit (flag) is set or not, e.g if a record is deleted or compressed or has a certain property. It seems like they are mostly exactly 1, 2, 4 or 8 bytes long, so I could easily use an unsigned integer value. However there are 2 things that bug me:

My ideas:

One could emulate a '6 Byte Flags' field by reading uint32 + uint 16 and then manually calculate the combined integer value. Did anyone say kludge/wart? Yeah. Looks like one.

Other representations I can think of could be 1110111100001 (which could get _extremely_ long) or a hash like:

%flags = ( 'is_deleted' => 0, # a known flag 'is_compressed' => 1, # another known flag '2^15' => 1, # a bit that is set but unknown );
(unknown bits with value 0 are not listed in order to conserve space)

Your ideas?

Thanks for all your input! I have a bit of a trouble deciding whether I should go with

%flags = ( 'is_deleted' => 0, # a known flag 'is_compressed' => 1, # another known flag '2^15' => 1, # a bit that is set but unknown );
or with
  flags => 0b00100010000...
  #            | is_deleted
  #                | is_compressed

because both are quite nice. I'm going to try both and see what works best. =) Regarding the parser grammar it becomes obvious that I need a custom data type for flag-fields. Maybe something like:

example: # name specification expected value - [ Flags, 'flags_example', ] flags_example: { "length": 4, # length of flags field (in bits or by +tes) "2^2": "is_deleted", # a known flag "2^6": "is_compressed", # another known flag ... }

context

The parser must be able to parse about 120 different 'records'. Since I don't want to hardcode all the different formats the parser is configurable by a YAML-file. A full record description is probably kinda boring, so here is the hex dump for a value, the parser grammar and the actual parsed data:

hex dump:

4B 53 49 5A 04 00 03 00 00 00 4B 57 44 41 0C 00 98 37 01 00 95 37 01 00 6C 2A 09 00

annotated hex dump:
 4B 53 49 5A                           Type           (KSIZ)
 04 00                                 Size           (always 4)
 03 00 00 00                           KwrdCount
 4B 57 44 41                           Type           (KWDA)
 0C 00                                 Size           (4 * KwrdCount)
 98 37 01 00 95 37 01 00 6C 2A 09 00   Keywords       FormID{count}
parser grammar:
example: # name specification expected value - [ type1, 'char[4]', 'x = KSIZ' ] - [ size1, 'uint[2]', 'x = 4', 's = 2' ] - << size1 begin >> - [ count, 'uint[size1]', 'x > 0' ] - << size1 end >> # -------------------------------------------------------------- # - [ type2, 'char[4]', 'x = KWDA' ] - [ size2, 'uint[2]', ] - << size2 begin >> - [ Keywords, 'uint[4]{count}', 'c > 0' ] - << size2 end >>

combining hex dump + grammar results in:

...
    example => {
      Keywords => [ '98 37 01 00', '95 37 01 00', '6C 2A 09 00' ],
    }
...    
(The output is a bit fudged, because Keywords => [] would actually contain the integer values. But then there would be nothing left resembling the original data, so I left the raw hex dump values.

How to read the parser grammar:

  • This parser grammer is written in YAML. (Actually the only reason for YAML is the ability to use comments. Strip the comments and it's JSON.)
  • lines beginning with # are comments and are only provided for documentational purpose
  • lines beginning with - indicate a parseable item
  • Square bracketed lines indicate a value to read from the data stream. first column is a suitable value name, second is the actual binary data format, third is optional and (if present) denotes one ore more conditiona that must be met in order for the value to be valid.
  • value names matching qr/[a-z\d]+/ are relevant only during parsing and are not part of the final result set. If the parsed data needs to be serialized into a data stream again, then these values are either calculated from the input value (3 'Keywords' => count=3) or if they are required to be a certain value they can be taken from the 'expected value' column (type1='KSIZ')
  • all other values are part of the returned parser result.
  • '<< (\w+) (begin|end)>>' signify the begin and end for calculating the relevant size value. (nasty: The length of 'size' itself may be a part of the actual value.)

not shown: sub records, alternatives, repeating records, ...

I'm pretty sure this library will end up on CPAN some day, for now I want to keep it private to be able to modify the API (and break backwards compatibility) at will. (And defer finding a suitable name until it's ready for submitting. Current name is File::Parse)

Replies are listed 'Best First'.
Re: Looking for ideas: Converting a binary 'flags' field into something human readable
by BrowserUk (Patriarch) on Jul 07, 2015 at 21:39 UTC
    Your ideas?

    If 8 bytes is the longest field, I think I'd be tempted to display the binary and annotate only those known fields something like this:

    flags1 => 0b010011000010000010100000000010000000000000010100000000 +0000000111; # | compressed # | deleted # | this # | that # | other # | something else # | and another # foo | # bar | # + up | # + down | # s +ideways |

    Not pretty, but very clear.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
    I'm with torvalds on this Agile (and TDD) debunked I told'em LLVM was the way to go. But did they listen!

      Or?

      flags1 => 0b010011000010000010100000000010000000000000010100000000 +0000000111; # | || |that | | |and another | |bar + up||| # |this |something else |foo + down| # |deleted | other + sideways| # |compressed

        I'd like to see the code that determines how to compress those together :)


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
        I'm with torvalds on this Agile (and TDD) debunked I told'em LLVM was the way to go. But did they listen!
    A reply falls below the community's threshold of quality. You may see it by logging in.
Re: Looking for ideas: Converting a binary 'flags' field into something human readable
by bitingduck (Chaplain) on Jul 10, 2015 at 00:37 UTC

    You've already got some code posted and a few suggestions, but when I had to do this a few months ago for a known number of bits, I used a D/A board and LEDs.

    :D