Monk::Thomas has asked for the wisdom of the Perl Monks concerning the following question:
Hello fellow monks
I wrote a parser library for a specific class of binary files (resource files for a video game). It converts the file into a human readable data structure. (hashes of hashes of array of hashes kinda thing; data fields that are only relevant for parsing the binary data streams are stripped from the result)
One of the data types it must be able to handle are 'flags' - a variable length sequence of bytes, where the actual value is uninteresting, the interesting part is whether a certain bit (flag) is set or not, e.g if a record is deleted or compressed or has a certain property. It seems like they are mostly exactly 1, 2, 4 or 8 bytes long, so I could easily use an unsigned integer value. However there are 2 things that bug me:
- what to do if there turns up to be a flags field which does not match integers? (e.g. 6 or 10 bytes)
- is there a better way to represent the 'flag/bit'-nature of the value?
My ideas:
One could emulate a '6 Byte Flags' field by reading uint32 + uint 16 and then manually calculate the combined integer value. Did anyone say kludge/wart? Yeah. Looks like one.
Other representations I can think of could be 1110111100001 (which could get _extremely_ long) or a hash like:
(unknown bits with value 0 are not listed in order to conserve space)%flags = ( 'is_deleted' => 0, # a known flag 'is_compressed' => 1, # another known flag '2^15' => 1, # a bit that is set but unknown );
Your ideas?
Thanks for all your input! I have a bit of a trouble deciding whether I should go with
or with%flags = ( 'is_deleted' => 0, # a known flag 'is_compressed' => 1, # another known flag '2^15' => 1, # a bit that is set but unknown );
flags => 0b00100010000... # | is_deleted # | is_compressed
because both are quite nice. I'm going to try both and see what works best. =) Regarding the parser grammar it becomes obvious that I need a custom data type for flag-fields. Maybe something like:
example: # name specification expected value - [ Flags, 'flags_example', ] flags_example: { "length": 4, # length of flags field (in bits or by +tes) "2^2": "is_deleted", # a known flag "2^6": "is_compressed", # another known flag ... }
context
The parser must be able to parse about 120 different 'records'. Since I don't want to hardcode all the different formats the parser is configurable by a YAML-file. A full record description is probably kinda boring, so here is the hex dump for a value, the parser grammar and the actual parsed data:
hex dump:4B 53 49 5A 04 00 03 00 00 00 4B 57 44 41 0C 00 98 37 01 00 95 37 01 00 6C 2A 09 00
annotated hex dump:4B 53 49 5A Type (KSIZ) 04 00 Size (always 4) 03 00 00 00 KwrdCount 4B 57 44 41 Type (KWDA) 0C 00 Size (4 * KwrdCount) 98 37 01 00 95 37 01 00 6C 2A 09 00 Keywords FormID{count}parser grammar:
example: # name specification expected value - [ type1, 'char[4]', 'x = KSIZ' ] - [ size1, 'uint[2]', 'x = 4', 's = 2' ] - << size1 begin >> - [ count, 'uint[size1]', 'x > 0' ] - << size1 end >> # -------------------------------------------------------------- # - [ type2, 'char[4]', 'x = KWDA' ] - [ size2, 'uint[2]', ] - << size2 begin >> - [ Keywords, 'uint[4]{count}', 'c > 0' ] - << size2 end >>
combining hex dump + grammar results in:
... example => { Keywords => [ '98 37 01 00', '95 37 01 00', '6C 2A 09 00' ], } ...(The output is a bit fudged, because Keywords => [] would actually contain the integer values. But then there would be nothing left resembling the original data, so I left the raw hex dump values.
How to read the parser grammar:
- This parser grammer is written in YAML. (Actually the only reason for YAML is the ability to use comments. Strip the comments and it's JSON.)
- lines beginning with # are comments and are only provided for documentational purpose
- lines beginning with - indicate a parseable item
- Square bracketed lines indicate a value to read from the data stream. first column is a suitable value name, second is the actual binary data format, third is optional and (if present) denotes one ore more conditiona that must be met in order for the value to be valid.
- value names matching qr/[a-z\d]+/ are relevant only during parsing and are not part of the final result set. If the parsed data needs to be serialized into a data stream again, then these values are either calculated from the input value (3 'Keywords' => count=3) or if they are required to be a certain value they can be taken from the 'expected value' column (type1='KSIZ')
- all other values are part of the returned parser result.
- '<< (\w+) (begin|end)>>' signify the begin and end for calculating the relevant size value. (nasty: The length of 'size' itself may be a part of the actual value.)
not shown: sub records, alternatives, repeating records, ...
I'm pretty sure this library will end up on CPAN some day, for now I want to keep it private to be able to modify the API (and break backwards compatibility) at will. (And defer finding a suitable name until it's ready for submitting. Current name is File::Parse)
|
---|