Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Re: Spreadsheet::ParseExcel with embedded PDF cells

by jmcnamara (Monsignor)
on Jan 09, 2009 at 16:27 UTC ( #735234=note: print w/ replies, xml ) Need Help??


in reply to Spreadsheet::ParseExcel with embedded PDF cells


The PDF files won't be embedded in the Excel document but rather in the OLE container/document that surrounds the Excel file.

As such Spreadsheet::ParseExcel isn't of any use in this case. If you want to extract the PDF files you will need to use OLE::Storage_Lite.

The first thing you will need to find out is the PPS (property set) name of the embedded objects. The smplls.pl utility that is part of the OLE::Storage_Lite will show you the File structure and the PPS names. For example:

perl smplls.pl Book1.xls 00 1 'Root Entry' (pps 0) ROOT 00.01.1900 +00:00:00 01 1 'Workbook' (pps 1) FILE 1000 +bytes 02 2 ' SummaryInformation' (pps 2) FILE 1000 +bytes 03 3 ' DocumentSummaryInformation' (pps 3) FILE 1000 +bytes

Then you can extract the PPS structures using OLE::Storage_Lite. Here is a sample program that extracts the "Summary Information" from an Excel file to get you started.

#!/usr/bin/perl use strict; use warnings; use OLE::Storage_Lite; my $file = 'Book1.xls'; my $stream_name = "\5SummaryInformation"; # Convert stream name to UTF16. $stream_name = pack 'v*', unpack 'C*', $stream_name; # Create the OLE reader object. my $ole = OLE::Storage_Lite->new($file); # Find the required stream in the OLE container. my $stream = ($ole->getPpsSearch([$stream_name], 1, 1))[0]; die "Couldn't find required OLE data in $file. $!\n" unless $strea +m; # Do something with the data. my $data = $stream->{Data}; # Remember to use binmode() on Windows. print $data;
Note, if the PPS name appears to start with a space it may actually be a low ordinal character such as "\0", "\1" or as in the case above "\5".

--
John.


Comment on Re: Spreadsheet::ParseExcel with embedded PDF cells
Select or Download Code
Re^2: Spreadsheet::ParseExcel with embedded PDF cells
by ForgotPasswordAgain (Deacon) on Jan 11, 2009 at 17:04 UTC
    Thanks, that looks very promising, if I can figure out the stream name for where the PDFs are.
Re^2: Spreadsheet::ParseExcel with embedded PDF cells
by ForgotPasswordAgain (Deacon) on Jan 21, 2009 at 17:20 UTC
    Probably nobody reading this now, but... I seem to be unable to associate the PDF files that I (successfully) extracted to the cells they're coming from. Is there any way to do that?
      If you send me an example file using the email address in the OLE::Storage_Lite docs I'll have a look at it and see if the cell addresses can be decoded out using ParseExcel.

      --
      John.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://735234]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (13)
As of 2014-07-29 18:32 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (226 votes), past polls