Re: Getting File Type using Regular Expressions

in reply to Getting File Type using Regular Expressions

Of course, Windows applications make a lot of assumptions about files based on their filenames (especially their extensions). But web servers or end users may not be running the same operating system. You may want to check file type by investigating the contents, not the filename. Even on Windows, is '.doc' always a Microsoft Word file?

The first two or four bytes of most files are often a very good clue as to the file's type. These bytes are usually referred as a "magic number." For example, the first two bytes of "BM" are common in Windows .bmp files. JPEG files start with 0xFF 0xD8 0xFF 0xE0 bytes. Unix scripts often start with a shebang: "#!".

Testing text files is a little more tricky, but there are three basic tests: (1) If all bytes in the first 128 or 256 bytes are just pure plain ASCII, then the odds are that's what the whole file is. (2) If all bytes in the first 1024 bytes are well-formed UTF-8, that's probably what the whole file is. (3) Other text encodings should be guessed by the overall distribution of characters. Single-byte German encodings will use certain non-ASCII bytes more often, while avoiding some bytes used in single-byte Cyrillic encodings.

On Unix, a tool called 'file' has a large and growing database of file type heuristics. File::Type is a Perl module equivalent. These read just enough of a file to make a solid guess as to the type, and report it.

--
[ e d @ h a l l e y . c c ]

In Section Seekers of Perl Wisdom