Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??
Of course, Windows applications make a lot of assumptions about files based on their filenames (especially their extensions). But web servers or end users may not be running the same operating system. You may want to check file type by investigating the contents, not the filename. Even on Windows, is '.doc' always a Microsoft Word file?

The first two or four bytes of most files are often a very good clue as to the file's type. These bytes are usually referred as a "magic number." For example, the first two bytes of "BM" are common in Windows .bmp files. JPEG files start with 0xFF 0xD8 0xFF 0xE0 bytes. Unix scripts often start with a shebang: "#!".

Testing text files is a little more tricky, but there are three basic tests: (1) If all bytes in the first 128 or 256 bytes are just pure plain ASCII, then the odds are that's what the whole file is. (2) If all bytes in the first 1024 bytes are well-formed UTF-8, that's probably what the whole file is. (3) Other text encodings should be guessed by the overall distribution of characters. Single-byte German encodings will use certain non-ASCII bytes more often, while avoiding some bytes used in single-byte Cyrillic encodings.

On Unix, a tool called 'file' has a large and growing database of file type heuristics. File::Type is a Perl module equivalent. These read just enough of a file to make a solid guess as to the type, and report it.

--
[ e d @ h a l l e y . c c ]


In reply to Re: Getting File Type using Regular Expressions by halley
in thread Getting File Type using Regular Expressions by bkiahg

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (4)
As of 2024-04-19 20:37 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found