Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much

Thoughts on designing a file format.

by demerphq (Chancellor)
on Sep 12, 2005 at 17:34 UTC ( #491321=perlmeditation: print w/ replies, xml ) Need Help??

Today I had to design a file format. I've had to do this a bunch of times in my career, and I think ive finally got a grip on it. Heres some of the things that I think are worthy of consideration.

BTW, this is written generally with flat, seperated values, type files in mind. No doubt an XML file needs to have similar data, but im looking at this through xSV glasses right now.

Header row
Each file should have a header row. The header row should contain data about the file as a whole. Thus it might include
  • The creation date
  • Who or what it was created by
  • Version information about the file. IE if your file is going to hold valid IP addresses and there may be an updated file with the same name, this field can be used to tell them apart.
  • File format version data. It should be possible to uniquely identify the version of the file format being read from the header. Even if you dont think there will be a format change in the future, keep this in mind. Even something like making the first field of the header a literal 'HDR' is flexible enough that the readers can be rewritten to handle a later 'HD2' type header and etc.
As a general rule the header should contain sufficient information that the filename can be regenerated from it. (Which is very useful if somebody renames the files for some reason)
Footer row
Each file should have a footer row. The footer should be suitable for determining if the file has been corrupted or damaged. It should be easy to tell apart from a data record. A simple trailer might be the fields ROWS,1234, which would show how many records are in the file. Be sure you specify if the count includes headers and footers or not.
Creator-Reserved Fields
Each data record in the file should contain a field that is used strictly by the _creator_ of the file. Any processors should be designed to ignore, but preserve, this field. This can come in very useful when you are trying to reconcile two feeds. So as an exmaple if you are providing a data delivery service from multiple inbound feeds you should provide a way for those feeds to "tag" each record in a way that suits them. When time comes to tell why the data being delivered to the end user doesnt match the data provided to the aggregator these fields become critical for reconciliation.
Data records should be easily identified
Each record type in the file should be easily identified from the others. Using the first field as an indicator works well.
Uniform, sortable date stamps.
Do youself a favour, drop whatever allegiance you have to your nations prefered date format and use something standard like an ISO compliant date stamp. (Ie YYYY-MM-DD HH:MM:SS). They are sortable, scalable, easily read, and utterly unambiguous. ALL of the other date formats (especially ones using two digit years) suffer from serious problems of ambiguity. YOU may know that the dates in the file are MM/DD/YY, but the German intern who is trying to parse your output very likey will not. While some would say that specification is for avoiding these problems I don't agree. Eventually you will get bitten anyway. With an ISO datestamp its unlikely that you ever will.
Specify seperators explicitly.
Do not say "all files are CSV files", if you mean "all files are MS-Excel compatible CSV format". Do not say "file is in CSV format". Say "fields in the file will be seperated by commas" and specify the hex ASCII code for the seperator. Consider using other separators as suits your data. Tab seperated values have some advantages for instance.
Specify line endings explicitly
Do not assume the reader will be using the line endings you are used to. Explicitly specify them. I prefer to use network line endings "CR-LF".
Put a document version number and date on the document
A lot of people leave this off, and it usually causes problems when they do.
Put your name on the document.
Be proud of it. Dont take your name off becuase that is company policy without a serious discussion of the policy. People should know who to talk to for clarification or changes.
File naming convention
The specification should include information about how the file should be named. It should be possible to reproduce the filename from the header. (Although not necessarily the opposite.)
Specify data types
Each field in the file should have its type described in an easily read way. People from business types to programmers will be reading it, so try to provide sufficient data than all of them can understand what is going on. Providing examples of field contents is a good idea, but can be tricky as making a mistake in the sample can really confuse things.
Specify numeric types carefully.
This includes specifying if thousands separators are to be used or if the decimal point will be a '.' or some other char. Consider using a regular expression in the documentation to denote specifically what numbers should look like. Do not assume that the company you are dealing with will necessarily have the same numeric formats as you do.
If speed is an issue consider using fixed width fields.
Using pack and unpack this can be an extremely efficient way to read the records, and has the somewhat useful property that you can tell how many records are in a file by its size.

No doubt ive missed stuff out here, or have stuff in here you disagree with. Please let me know what. :-)

Update: Added a few things that occured to me that others havent mentioned yet.


Comment on Thoughts on designing a file format.
Download Code
Re: Thoughts on designing a file format.
by Corion (Pope) on Sep 12, 2005 at 17:40 UTC

    As a side issue regarding file names, there is one essential rule that I've found to be very important with systems that generate reports:

    All timestamps in the filename must be the date for which the report was run, and not the time of creation of the report.

    Following this rule makes it easy and convenient for your consuming program to pick up processing the data whenever a system delivers its file "later" than expected, and keeps yourself sane whenever a system misses the 0:01am deadline...

Re: Thoughts on designing a file format.
by ww (Bishop) on Sep 12, 2005 at 17:42 UTC
    Would you consider it worthwhile to add a field (or set of fields) for an update/edit history, incorporating both date of the (edit/update) and the author of same?

    An impractical (but in some ways desireable) addition would be a field or set of fields to hold sequential difs for each edit. Notion is that it would amount to an internal cv repository, which would afford a subsequent editor/reader some hints about whether (for example) writer B's remark was in the same thread of changes as writer A's (if not, it might be merely similar or tangential, whereas if in same_thread, might be a child of A's).

    Update: I view including dif's as "impractical" because doing so could bloat a file worse than M$ does (mucho header/file info, very little unique content) say nothing of complexities (what do you do about the reader/revisor who's not running on a compatible OS or using a compliant editor?)

      Well I would view that as either a header issue, that is on the file level, or as a data record specification issue. I should say though that I tried to stay away from the data as a whole as I'm more concerned with the strucuture of the container and not so much what it contains.

      In other words, in some situations support for change records is required, but how its implemented I would think would require a lot of contextual analysis that can't be broadly generalized. Although please feel free to outline your thoughts on the subject, its not really something thats come up for me regularly, in fact only once really.


Re: Thoughts on designing a file format.
by gargle (Hermit) on Sep 12, 2005 at 19:41 UTC


    A lovely node! My years as a cobol programmer can come in handy for treating batch programs :) Just a few remarks:

    put the name of the program creating the file, a date and a time of creation and a date and time of modification. Also put the filename in the header! The point is that the header identifies the file. So you need info about which program created/modified and when. You can even put extra info about the programs that are expected to modify next or treated the file before the current program had a go at it!
    put the total number of records (header and trailer inclusive). Put a total (or a md5 total) for the most import fields in your data. Include the number of data records. If you decide to keep creation data/time in the header put your date and time of modification here.
    Just as you use a header and trailer for the complete file use seperate headers and trailers for blocks of data in your file. Your data records should be identifiable by 0 for a header, 1 for real data and 9 for a trailer of a data block.

    You'll end up with:


    Some comments:

    name program creating the file
    This allows you to check if the next program treating the file is the correct one. You can controle the sequence of treatment by isuing a die if the correct sequence is out of order. To make this even better put a second field to name the program that will process the file next.
    trailer info
    Reading the complete file and adding subtotals allows you to check the trailer for modifications in the file.
    data block trailer
    Reading the datablock and adding subtotals allows you to check the data block for integrity

    Of course, all other notes of the OP count as well (however, thinking in cobol makes me prefer fixed record lengths)

    More info: Jackson

    if ( 1 ) { $postman->ring() for (1..2); }
Re: Thoughts on designing a file format.
by exussum0 (Vicar) on Sep 12, 2005 at 21:22 UTC
    Data records should be easily identified. Each record type in the file should be easily identified from the others. Using the first field as an indicator works well.

    I've worked with data similar to what you are talking about many times, and I cannot stress how important this is. Between record version type, this should change as well. You can tie this in to specific record related information like what the delimiters are, or record length.

    Record labels accomplishe 1 key thing that trumps everything else you have said: You can write a parser for a single record w/o trampling over other record types. Without this, you will have no way to determine when data starts or ends between records. You would have to write some complex code to figure out record delimiters which may or may not be consistent.

    It gets ugly fast, especially if someone creates a new record format w/o telling you.

    Give me strength for today.. I will not talk it away.. Just for a moment..
    It will burn through the clouds.. and shine down on me.

Re: Thoughts on designing a file format.
by bsb (Priest) on Sep 13, 2005 at 02:45 UTC
    Nice post. I've learnt too many of these the hard way with inherited formats. In particular, one format lacked easily identifiable record types so the a record's subtype may depend on the first field or the first and second, the length of the second depending on the first...

    Other thoughts:

    • Dependency ordered records, that is parents before their children.
    • Count, checksum or hash at the end (I think a line count gives a false sense of protection from corruption).
    • The two points above make file processing more streamable.
    • Consider specifying character encodings.
    • Don't be cryptic to save characters, have a plain text format and zip it.
    See also the Art of Unix Programming
Re: Thoughts on designing a file format.
by adamc00 (Initiate) on Sep 13, 2005 at 05:32 UTC
    We also work on this sort of stuff a lot, here are some additional thoughts.

    We dropped the requirement for a footer row in favour of an MD5 checksum since it is a better indication of file corruption than a footer count. Once you have an MD5 there's really no need for the count.

    >Data records should be easily identified
    > Each record type in the file should be easily identified
    > from the others. Using the first field as an indicator > works well."

    If you used fixed length records (and therefore fields) save yourself some heartache and make sure that all the record indicators are the same length. Yep, we've seen it done otherwise. When done this way a simple chunk off the front of the record tells you what to expect, gymnastics are required otherwise.

    Also leave plenty of space, because at some point there might be variations on record types that are acceptable and a sub version can be handy.

    A final anecdote. Dates, how might I **** thee, let me count the ways.

    We were involved in rescuing 3 months of data that had been entered where on one of the workstations, and I quote... "Sometimes, on this one, the dates don't work. When that happens we just swap the day and month and it's OK.". Fan*******tastic.

Re: Thoughts on designing a file format.
by greenFox (Vicar) on Sep 13, 2005 at 06:15 UTC

    All good points ++

    I read a paper once that explained very clearly why the two character line endings (CR, LF) in DOS was a mistake, I have no idea where it was but Wikipedia echoes the sentiment. Either way documenting it is OK but using the line endings appropriate for your OS is a better approach.

    Two points I would add

    • Allow comments -my preference is for # comments
    • Ignore blank and whitespace only lines

    Hence any data file parsing I do usually ends up beginning like this-

    next if /^\s*#/; next if /^\s*$/;

    Murray Barton
    Do not seek to follow in the footsteps of the wise. Seek what they sought. -Basho

      I prefer to use network line endings because that is the standard netowrk line ending, and because quite simply there will come a day when your file needs to be read by someone whos most advanced tool for reading it will be Excel. Likewise I tend to use csv so that cut and pasting from the file to an Excel workbook works correctly, not to mention the fact that for the type of data I use embedded tabs are never a problem, but occasionally embedded commas are.


        I'm missing something here. On DOS if you write print FILE "some text\n"; you will get "\r\n" in the file. If you do the same thing on Unix you get just "\n". What are you outputing? Are you setting $INPUT_RECORD_SEPERATOR and $OUTPUT_RECORD_SEPERATOR to something other than default? Otherwise chomp is going to break for example, it will remove "\r\n" on DOS and just "\n" on Unix leaving a "\r" at the end of every line. It seems like a lot of trouble to deal with something that ftp clients do automatically... if I copy your program and data file over to Unix I have to then change the line endings back to CR/LF before it works???

        Murray Barton
        Do not seek to follow in the footsteps of the wise. Seek what they sought. -Basho

      I read a paper once that explained very clearly why the two character line endings (CR, LF) in DOS was a mistake

      Now, let me explain why two-character line endings in DOS was *not* a mistake...

Re: Thoughts on designing a file format.
by leriksen (Curate) on Sep 13, 2005 at 23:43 UTC
    Something I have found useful for flat files, not hierarchical files, is to have the names of the columns as the first line.

    This helps avoid hardcoding the column names in your parsing/construction code - I'd expect the names will embedded in the code that utilises the constructed structure, after all, its hard to escape the need to type "$row->{address}" when you need to access the address field (or $row->address() if you've built objects to give yourself one degreee of separation)

    With the columns names as the first row, you are insulated from the introduction of new columns until you really need them. For example, say you have columns 'name','street','town','country'. You have code to parse this line and create the appropriate accessors, and more code that reads a line and, for each line, returns an object that has these accessor methods. You then have code that uses these objects to build address labels or populate a database. Then one day the client who suplies these files adds a new field, say zip code, and it is in the data between the town and country columns. NONE of your code has to change until your ready to use the zip code field. Your objects have an extra accessor, which is automatically created by the code that parses the column-description line, but nothing else is affected.

    It could be you want to add the zip code to the address labels, but you dont want to change the DB schema to capture zip codes. You change the label generation code, and leave the DB insert code alone. So your DB inserts continue to work, and your address labels now have zip codes.

    All because you were able to have the client give you one extra line in the supplied flat files.

    ...reality must take precedence over public relations, for nature cannot be fooled. - R P Feynmann

      Field header rows can be useful indeed. However they are only really applicable to files that contain only a single record type. Many of the files i deal with contain multiple record types. But its a good point, thanks.


Re: Thoughts on designing a file format.
by radiantmatrix (Parson) on Sep 14, 2005 at 17:17 UTC

    I've done a number of file formats as well, and there are two pieces of advice I'd like to add to your excellent list:

    1. Explicitly specify your escape methodology: if you are creating a CSV file, how will a comma in the data be escaped?
    2. If possible, use record and unit separators that are unlikely to exist in your data: for example, I like to use the ASCII chars \x1E\x0A ("Record Separator"+ newline) and \x1F ("Unit Separator") to separate records and elements, respectively. These are unlikely to appear in text data (unlike columns, tabs, etc.) and reduce the complexity of the escaping strategy that will be required.

    In many cases, combining these can result in "the record-separator and element-separator chars are not allowed in text data" as an escaping strategy. This means you can use code like:

    open my $F_data, '<', 'filename.dat' or die("bad open: $!"); local $\ = "\x1E\x0A"; while (<$F_data>) { my @row = split("\x1F", $_); process (\@row); }
    Instead of relying on (admittedly excellent) modules like Text::CSV_XS. Using these chars tremendously simplifies one's life!

    Larry Wall is Yoda: there is no try{} (ok, except in Perl6; way to ruin a joke, Larry! ;P)
    The Code that can be seen is not the true Code
    "In any sufficiently large group of people, most are idiots" - Kaa's Law

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlmeditation [id://491321]
Approved by Corion
Front-paged by Courage
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others musing on the Monastery: (11)
As of 2015-01-29 19:58 GMT
Find Nodes?
    Voting Booth?

    My top resolution in 2015 is:

    Results (244 votes), past polls