Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot

comment on

( #3333=superdoc: print w/replies, xml ) Need Help??
Of course, Windows applications make a lot of assumptions about files based on their filenames (especially their extensions). But web servers or end users may not be running the same operating system. You may want to check file type by investigating the contents, not the filename. Even on Windows, is '.doc' always a Microsoft Word file?

The first two or four bytes of most files are often a very good clue as to the file's type. These bytes are usually referred as a "magic number." For example, the first two bytes of "BM" are common in Windows .bmp files. JPEG files start with 0xFF 0xD8 0xFF 0xE0 bytes. Unix scripts often start with a shebang: "#!".

Testing text files is a little more tricky, but there are three basic tests: (1) If all bytes in the first 128 or 256 bytes are just pure plain ASCII, then the odds are that's what the whole file is. (2) If all bytes in the first 1024 bytes are well-formed UTF-8, that's probably what the whole file is. (3) Other text encodings should be guessed by the overall distribution of characters. Single-byte German encodings will use certain non-ASCII bytes more often, while avoiding some bytes used in single-byte Cyrillic encodings.

On Unix, a tool called 'file' has a large and growing database of file type heuristics. File::Type is a Perl module equivalent. These read just enough of a file to make a solid guess as to the type, and report it.

[ e d @ h a l l e y . c c ]

In reply to Re: Getting File Type using Regular Expressions by halley
in thread Getting File Type using Regular Expressions by bkiahg

Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":

  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?

    What's my password?
    Create A New User
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others imbibing at the Monastery: (5)
    As of 2019-08-21 23:15 GMT
    Find Nodes?
      Voting Booth?

      No recent polls found