Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

determine file type from data read from filehandle

by expo1967 (Novice)
on Aug 07, 2018 at 16:02 UTC ( #1220013=perlquestion: print w/replies, xml ) Need Help??
expo1967 has asked for the wisdom of the Perl Monks concerning the following question:

at the office I am modifying a PERL CGI script that processes data read from a file selected by the user from an HTML file field. given the file handle of the opened file in the Perl CGI script is there some way to determine if the data is text or binary ? The old version of the script just deal with CSV type data. Now I need to modify the script to be able to deal with excel spreadsheets. I already have code that can access spreadsheets. Since the file "selected" by the user may be a CSV type of file without a CSV extension I just can't simply look at the filename. Any ideas on how to determine the file type by testing the data from a read of the open file handle ? Thanks.
  • Comment on determine file type from data read from filehandle

Replies are listed 'Best First'.
Re: determine file type from data read from filehandle
by choroba (Bishop) on Aug 07, 2018 at 16:27 UTC
    XLSX files are in fact zip files, so you can try to unzip them to test whether they are XLSX or not. The older XLS files are memory dumps of MSExcel, but they should still follow a pattern. Maybe try MIME::Detect or one of the other libraries mentioned in its SEE ALSO section.
    ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,
Re: determine file type from data read from filehandle
by Your Mother (Bishop) on Aug 07, 2018 at 16:24 UTC

    You should never trust the file extension even if it's there. You can try File::Type or File::MMagic. I would probably just do something like (first-draft logic, untested)–

    if ( "ok" eq eval { process_upload_as_excel($upload); "ok" } ) { # Do stuff, report success. } elsif ( $@ ) { my $first_err = $@; if ( "ok" eq eval { process_upload_as_cvs($upload); "ok" } ) { # Do stuff, report success. } else { die "Couldn't process as Excel or CVS..."; # Include $first_err and (new one) $@. Or don't. # I'm a code comment, not a cop. } } else { die "Unknown failure!"; }

      While I agree that CSV files do come with all kinds of extensions, including none or even .xls, Excel comes from a world where file extension matters. So I wouldn't process an Excel file with a .txt extension, even if it worked, because if I received such a file (or an .xlsx file that does not contain Excel data), I'd think there's something wrong happening with the input data.

        I definitely see the point but for it to matter in fact you have to come up with a .txt or any other file type that will pass Excel parsing without error and return a workbook with content. That seems, to me, like a concern that can be safely punted to: "Failed to import foo.bar: because reasons..."

Re: determine file type from data read from filehandle
by bliako (Chaplain) on Aug 07, 2018 at 22:52 UTC

    The 3 file-detect modules mentioned already: File::Type, MIME::Detect, File::MMagic all report an xlsx file as application/zip because it really is a zip file as choroba wrote.

    The catch is that even if your uploaded file is a zip file, it is not necessarily an xlsx. So, you will probably have to extract the list of files from the archive and check whether certain files any xlsx document should contain are present. I am not sure if there is a method 100% accurate to do that (detection I mean) due to possible exceptions unless you feed it to M$. In fact I do not know if xlsx file format is officially public knowledge or one has to reverse engineer it.

    If you want to go light and moduleless the file signature of an xlsx file I created with LibreOffice at a linux/intel box using hexdump is 4b50 0403 0014 0808 which is comparable - bar endianess - to, say, info from https://www.garykessler.net/library/file_sigs.html.

    Then, even detecting a csv file can be tricky, if it contains unicode. You can't even count on the abundance of commas it should normally contain because they may be encoded in unicode as fancy counter-clockwise commas :) they do that with inverted commas and even gcc spits out unicode nowadays.

Re: determine file type from data read from filehandle
by Eily (Prior) on Aug 07, 2018 at 16:20 UTC

    Please look at Markup in the Monastery to add some formatting to your post.

    To answer your question, I'd still look at the file extension as a first option. If that fails, the -T or -B tests might help, but as documented, they are guesses

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1220013]
Approved by Corion
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others about the Monastery: (8)
As of 2018-10-17 12:31 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    When I need money for a bigger acquisition, I usually ...














    Results (92 votes). Check out past polls.

    Notices?