Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

Re: Spreadsheet::ParseExcel with embedded PDF cells

by jmcnamara (Monsignor)
on Jan 09, 2009 at 16:27 UTC ( #735234=note: print w/ replies, xml ) Need Help??


in reply to Spreadsheet::ParseExcel with embedded PDF cells


The PDF files won't be embedded in the Excel document but rather in the OLE container/document that surrounds the Excel file.

As such Spreadsheet::ParseExcel isn't of any use in this case. If you want to extract the PDF files you will need to use OLE::Storage_Lite.

The first thing you will need to find out is the PPS (property set) name of the embedded objects. The smplls.pl utility that is part of the OLE::Storage_Lite will show you the File structure and the PPS names. For example:

perl smplls.pl Book1.xls 00 1 'Root Entry' (pps 0) ROOT 00.01.1900 +00:00:00 01 1 'Workbook' (pps 1) FILE 1000 +bytes 02 2 ' SummaryInformation' (pps 2) FILE 1000 +bytes 03 3 ' DocumentSummaryInformation' (pps 3) FILE 1000 +bytes

Then you can extract the PPS structures using OLE::Storage_Lite. Here is a sample program that extracts the "Summary Information" from an Excel file to get you started.

#!/usr/bin/perl use strict; use warnings; use OLE::Storage_Lite; my $file = 'Book1.xls'; my $stream_name = "\5SummaryInformation"; # Convert stream name to UTF16. $stream_name = pack 'v*', unpack 'C*', $stream_name; # Create the OLE reader object. my $ole = OLE::Storage_Lite->new($file); # Find the required stream in the OLE container. my $stream = ($ole->getPpsSearch([$stream_name], 1, 1))[0]; die "Couldn't find required OLE data in $file. $!\n" unless $strea +m; # Do something with the data. my $data = $stream->{Data}; # Remember to use binmode() on Windows. print $data;
Note, if the PPS name appears to start with a space it may actually be a low ordinal character such as "\0", "\1" or as in the case above "\5".

--
John.


Comment on Re: Spreadsheet::ParseExcel with embedded PDF cells
Select or Download Code
Re^2: Spreadsheet::ParseExcel with embedded PDF cells
by ForgotPasswordAgain (Deacon) on Jan 11, 2009 at 17:04 UTC
    Thanks, that looks very promising, if I can figure out the stream name for where the PDFs are.
Re^2: Spreadsheet::ParseExcel with embedded PDF cells
by ForgotPasswordAgain (Deacon) on Jan 21, 2009 at 17:20 UTC
    Probably nobody reading this now, but... I seem to be unable to associate the PDF files that I (successfully) extracted to the cells they're coming from. Is there any way to do that?
      If you send me an example file using the email address in the OLE::Storage_Lite docs I'll have a look at it and see if the cell addresses can be decoded out using ParseExcel.

      --
      John.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://735234]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (4)
As of 2015-07-04 04:23 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (57 votes), past polls