Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

Re: How to recognize Word and XLS files

by jmcnamara (Monsignor)
on Mar 06, 2012 at 09:46 UTC ( #958047=note: print w/replies, xml ) Need Help??


in reply to How to recognize Word and XLS files

Here is a small program that uses OLE::Storage_Lite to distinguish distinguish Microsoft doc and xls files.

#!/usr/bin/perl use strict; use warnings; use OLE::Storage_Lite; my @files = ( 'test.xls', 'test.doc', 'test.ppt', 'test.txt', ); for my $filename ( @files ) { printf( "%-20s = %s\n", $filename, check_ole_filetype( $filena +me ) ); } sub check_ole_filetype { my $filename = shift; # Check that the file exists. return 'not_found' if !-e $filename; # Create an OLE::Storage_Lite object to read the file. my $ole = OLE::Storage_Lite->new( $filename ); my $pps = $ole->getPpsTree(); # If getPpsTree() failed then this isn't an OLE file. return 'not_ole_file' if !$pps; # Loop through the PPS children below the root. for my $child_pps ( @{ $pps->{Child} } ) { my $pps_name = OLE::Storage_Lite::Ucs2Asc( $child_pps->{Na +me} ); # Match an Excel xls file. if ( $pps_name eq 'Workbook' || $pps_name eq 'Book' ) { return 'xls'; } # Match a Word document. if ( $pps_name eq 'WordDocument') { return 'doc'; } } return 'unknown_ole_file'; } __END__ Output: $ perl ole_check.pl test.xls = xls test.doc = doc test.ppt = unknown_ole_file test.txt = not_ole_file

You will probably have to harden it a little for your needs. For example it is possible that some older Word files might have a differed $pps_name. A little testing should highlight if that is the case. Also, this won't find Office 2007+ style docx or xlsx files.

--
John.

Replies are listed 'Best First'.
Re^2: How to recognize Word and XLS files
by Anonymous Monk on Mar 06, 2012 at 10:12 UTC

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://958047]
help
Chatterbox?
[Corion]: But in general, it seems to be an interesting approach I should think about - whenever I'm searching for something, to consider if I could search for the end of the token instead of the start of the token

How do I use this? | Other CB clients
Other Users?
Others examining the Monastery: (10)
As of 2016-12-06 15:09 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    On a regular basis, I'm most likely to spy upon:













    Results (108 votes). Check out past polls.