Today I had to design a file format. I've had to do this a bunch of times in my career, and I think ive finally got a grip on it. Heres some of the things that I think are worthy of consideration.
BTW, this is written generally with flat, seperated values, type files in mind. No doubt an XML file needs to have similar data, but im looking at this through xSV glasses right now.
- Header row
- Each file should have a header row. The header row should contain data about the file as a whole. Thus it might include
As a general rule the header should contain sufficient information that the filename can be regenerated from it. (Which is very useful if somebody renames the files for some reason)
- The creation date
- Who or what it was created by
- Version information about the file. IE if your file is going to hold valid IP addresses and there may be an updated file with the same name, this field can be used to tell them apart.
- File format version data. It should be possible to uniquely identify the version of the file format being read from the header. Even if you dont think there will be a format change in the future, keep this in mind. Even something like making the first field of the header a literal 'HDR' is flexible enough that the readers can be rewritten to handle a later 'HD2' type header and etc.
- Footer row
- Each file should have a footer row. The footer should be suitable for determining if the file has been corrupted or damaged. It should be easy to tell apart from a data record. A simple trailer might be the fields ROWS,1234, which would show how many records are in the file. Be sure you specify if the count includes headers and footers or not.
- Creator-Reserved Fields
- Each data record in the file should contain a field that is used strictly by the _creator_ of the file. Any processors should be designed to ignore, but preserve, this field. This can come in very useful when you are trying to reconcile two feeds. So as an exmaple if you are providing a data delivery service from multiple inbound feeds you should provide a way for those feeds to "tag" each record in a way that suits them. When time comes to tell why the data being delivered to the end user doesnt match the data provided to the aggregator these fields become critical for reconciliation.
- Data records should be easily identified
- Each record type in the file should be easily identified from the others. Using the first field as an indicator works well.
- Uniform, sortable date stamps.
- Do youself a favour, drop whatever allegiance you have to your nations prefered date format and use something standard like an ISO compliant date stamp. (Ie YYYY-MM-DD HH:MM:SS). They are sortable, scalable, easily read, and utterly unambiguous. ALL of the other date formats (especially ones using two digit years) suffer from serious problems of ambiguity. YOU may know that the dates in the file are MM/DD/YY, but the German intern who is trying to parse your output very likey will not. While some would say that specification is for avoiding these problems I don't agree. Eventually you will get bitten anyway. With an ISO datestamp its unlikely that you ever will.
- Specify seperators explicitly.
- Do not say "all files are CSV files", if you mean "all files are MS-Excel compatible CSV format". Do not say "file is in CSV format". Say "fields in the file will be seperated by commas" and specify the hex ASCII code for the seperator. Consider using other separators as suits your data. Tab seperated values have some advantages for instance.
- Specify line endings explicitly
- Do not assume the reader will be using the line endings you are used to. Explicitly specify them. I prefer to use network line endings "CR-LF".
- Put a document version number and date on the document
- A lot of people leave this off, and it usually causes problems when they do.
- Put your name on the document.
- Be proud of it. Dont take your name off becuase that is company policy without a serious discussion of the policy. People should know who to talk to for clarification or changes.
- File naming convention
- The specification should include information about how the file should be named. It should be possible to reproduce the filename from the header. (Although not necessarily the opposite.)
- Specify data types
- Each field in the file should have its type described in an easily read way. People from business types to programmers will be reading it, so try to provide sufficient data than all of them can understand what is going on. Providing examples of field contents is a good idea, but can be tricky as making a mistake in the sample can really confuse things.
- Specify numeric types carefully.
- This includes specifying if thousands separators are to be used or if the decimal point will be a '.' or some other char. Consider using a regular expression in the documentation to denote specifically what numbers should look like. Do not assume that the company you are dealing with will necessarily have the same numeric formats as you do.
- If speed is an issue consider using fixed width fields.
Using pack and unpack this can be an extremely efficient way to read the records, and has the somewhat useful property that you can tell how many records are in a file by its size.
No doubt ive missed stuff out here, or have stuff in here you disagree with. Please let me know what. :-)
Update: Added a few things that occured to me that others havent mentioned yet.