Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

Getting File Type using Regular Expressions

by bkiahg (Pilgrim)
on Apr 21, 2004 at 13:04 UTC ( #346987=perlquestion: print w/replies, xml ) Need Help??

bkiahg has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks,

I am trying to place a check on the file type of a file that I am uploading. It is on a windows based system and I need to see if it's an image file (.jpg, .gif, .bmp, etc.) or a text file (.doc, .txt, etc.) I am extremely new to regular expressions.

Thanks in advance!
  • Comment on Getting File Type using Regular Expressions

Replies are listed 'Best First'.
Re: Getting File Type using Regular Expressions
by arden (Curate) on Apr 21, 2004 at 13:10 UTC
    So what have you tried so far? This is a topic covered very extensively and thoroughly in nearly every Perl book I've ever seen!

    Just to be helpful, try something like  $filename =~ /\.(jpg|gif|bmp)$/ for image files. I'll assume you can figure out the text file version from that. . .

    - - arden.

    Update: For more information, see Perl in a Nutshell Chapter 4 section 6, Programming Perl Chapter 1 section 7, Learning Perl Chapter 7 section 1, Perl Cookbook Chapter 6, or even the Tutorials here at PerlMonks!

Re: Getting File Type using Regular Expressions
by halley (Prior) on Apr 21, 2004 at 13:37 UTC
    Of course, Windows applications make a lot of assumptions about files based on their filenames (especially their extensions). But web servers or end users may not be running the same operating system. You may want to check file type by investigating the contents, not the filename. Even on Windows, is '.doc' always a Microsoft Word file?

    The first two or four bytes of most files are often a very good clue as to the file's type. These bytes are usually referred as a "magic number." For example, the first two bytes of "BM" are common in Windows .bmp files. JPEG files start with 0xFF 0xD8 0xFF 0xE0 bytes. Unix scripts often start with a shebang: "#!".

    Testing text files is a little more tricky, but there are three basic tests: (1) If all bytes in the first 128 or 256 bytes are just pure plain ASCII, then the odds are that's what the whole file is. (2) If all bytes in the first 1024 bytes are well-formed UTF-8, that's probably what the whole file is. (3) Other text encodings should be guessed by the overall distribution of characters. Single-byte German encodings will use certain non-ASCII bytes more often, while avoiding some bytes used in single-byte Cyrillic encodings.

    On Unix, a tool called 'file' has a large and growing database of file type heuristics. File::Type is a Perl module equivalent. These read just enough of a file to make a solid guess as to the type, and report it.

    --
    [ e d @ h a l l e y . c c ]

      On Unix, a tool called 'file' has a large and growing database of file type heuristics. File::Type is a Perl module equivalent. These read just enough of a file to make a solid guess as to the type, and report it.
      I feel that the File::MMagic module provides a more mature alternative to the File::Type module to provide file-like functions. This module has the added advantage in that it can read in the same magic files as those employed for file to extend its recognition profile.

       

      perl -le "print unpack'N', pack'B32', '00000000000000000000001011010010'"

      Of course, Windows applications make a lot of assumptions about files based on their filenames (especially their extensions).
      Yep, that's pretty dumb.
      You may want to check file type by investigating the contents, not the filename.
      That makes some sense. But not overly because it's pretty hard to do.
      The first two or four bytes of most files are often a very good clue as to the file's type. These bytes are usually referred as a "magic number."
      <rant>
      Well, guessing the type of the content of a file based on the first two bytes (or rather, the first couple of bytes, /etc/magic allows for variable formats) is not much smarter than using the file name. Sure, you are free to choose your filename - but who takes advantage of that? Noone stores gif images in files ending in .pl, and if you put your C program in a file called "fuddly-bumps.html", chances are your compiler will not take you seriously and refuse to compile your program. Not to mention that the classical Unix build program, make entirely depends on filenames to build the targets. And yes, another advantage is that you don't need a filename to make a guess. But the disadvantage is that the magic number gets in the way. Not much of a problem for binary formats which are purely processed by programs. But annoying, and prone to error for anything edited by humans. Furthermore, it still is uncontrolled guesswork (just like file-extensions). Anyone can invent a magic number, whether it's in use or not, there's no official way of keeping track, making sure there are no collisions etc. Here's a small example of the dumbness of magic numbers:
      The Netpbm project uses several (related) file formats. The magic numbers are "P1", "P2", "P3", "P4", "P5" and "P6". Looks simple. Looks extentable as well, doesn't? If more formats are needed, just continue the numbering. "P7", "P8", "P9", "P10". Right? No. If you start a file with a P followed by a 1, regardless of what follows, file thinks it's a "Netpbm PBM image text". Even if it's a simple text file that starts with the sentence "P100s of Samsung are really cool phones".
      My point is that magic numbers suck as bad as file extensions. Both magic numbers and file extensions work in practise reasonably well because people follow de facto standards. Windows uses file extensions almost exclusively. Unix (and with that, I mostly mean Unix tools) rely on both. Some tools use magic numbers. Some use tools use file extensions. Some use both.
      </rant>

      Abigail

        All we have in any situation is context and convention. Intuition won't solve everything, and computational completeness won't solve everything. Perhaps the byte sequence "P100s of Samsung are really cool phones" is a perfectly well-formed 6x6 pixel GIF file. You can only guess at the intent, and more data gives a better guess. That's why they call them 'heuristics.'

        That said, malicious users will attack any such heuristic assumptions to their favor. Britney.jpg.exe If your upload code expects web-intended images and only wants to accept web-intended images, it benefits the system to expect that any available heuristic passes muster. If it's not .jpg, toss it. If it's not JPEG magic, toss it. If the ImageMagic tool says the pixel dimensions are over 10000 in either dimension, toss it.

        --
        [ e d @ h a l l e y . c c ]

        It is funny, out of that entire post I did not see a valid recommendation at all. All I saw was bashing of others recomendations.

        An easy (but not fool proof) way of distinguishing between a text file or a binary file is use the ready-made -B file check operator. Or the -T (depending on which way your flag flies).

        print "File is binary\n" if (-B); print "File is a text file\n" if (-T);
        or if you are a "NOT" guy/girl the following might suit your fancy.
        print "File is binary\n" if (!-T); print "File is a text file\n" if (!-B);
        Hope that helps some.
      Thank you halley I will use both.
Re: Getting File Type using Regular Expressions
by kvale (Monsignor) on Apr 21, 2004 at 13:11 UTC
    Welcome to the wonderful world of regexes! To learn more about regular expressions, check out my perlrequick tutorial.

    With reference to your particular question, alternation will help here:

    if ($filename =~ /\.(jpg|gif|bmp|doc|txt)$/ { print "Found a match\n"; }

    -Mark

      Thanks for the online tutorials guys. Will definitely read up on them.
Re: Getting File Type using Regular Expressions
by kiat (Vicar) on Apr 21, 2004 at 15:04 UTC
    May be you could try this (You said 'uploading' so I'm assuming it may be web-based):

    use CGI; my $q = new CGI; my $file = $q->param('myfile') my $type = $q->uploadInfo($file)->{'Content-Type'}; print "$type\n";
      Yes it is a CGI ap. What is the difference between $q->uploadInfo($file)->{'Content-Type'}; and using File::Type or File::MMagic? Do they check the same thing or what criteria do the base their assumptions on?

        The code:

        $type = $query->uploadInfo($filename)->{'Content-Type'};
        returns the Content-Type header the browser added to the upload. Basically, it's what the browser (or the user's computer) thinks this file is. It's not fool-proof however, and browsers are not required to include it.

        See the CGI documentation for more information.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://346987]
Approved by arden
Front-paged by broquaint
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (8)
As of 2019-07-18 15:04 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?