http://www.perlmonks.org?node_id=686498

ftumsh has asked for the wisdom of the Perl Monks concerning the following question:

Lo,

I'm trying to identify various types of text file, xml, csv etc.

The idea being that it is presented with a text file and it works outwhat type it is.

The one file format I am having trouble with is fixed width. The definition of a fixed width file being:

1) Text file made up of records (ie LF or CRLF delimited) 2) Different records may be of different lengths 3) Records of a particular may be denoted by starting with particular characters or by the length of the record.

As you know, variants of the above are legion, so I only expect(hope) to get a largish percentage.

The only test I have at the moment is if the length of every record is the same and it's failed the tests for other file types, ie I'm testing for fixed width after all else.

Typically in a simple case a file will contain a header record followed by line records. This will repeat down the file. eg

Hfoobar L123456field2 L... H... L... L... etc

In a more complicated file, the header and line will be split across multiple records eg

Hfield1field2 Ffield1field2 part of header still Afield3 field4 still part of header

Now I can look at a file by eye and say yes it's fixed width, so it should be possible to do so programmatically.

The options I have up to press: 1) Try and work out if it's fixed width 2) Say hey, we got this far so it's fixed width (will give false positive on random text files) 3) work out if it's a text file containing prose, if it's not, it's fixed width

The text files my module will be presented with should be computer generated, so prose text is a mistake and not happen too often. The whole point of this is to try and cut out humans trying to identify a file. In other words, I don't expect it to catch every fixed width file.

So, all and any suggestions gratefully received.

John