Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister

Re: Spreadsheet::ParseExcel with embedded PDF cells

by jmcnamara (Monsignor)
on Jan 09, 2009 at 16:27 UTC ( #735234=note: print w/replies, xml ) Need Help??

in reply to Spreadsheet::ParseExcel with embedded PDF cells

The PDF files won't be embedded in the Excel document but rather in the OLE container/document that surrounds the Excel file.

As such Spreadsheet::ParseExcel isn't of any use in this case. If you want to extract the PDF files you will need to use OLE::Storage_Lite.

The first thing you will need to find out is the PPS (property set) name of the embedded objects. The utility that is part of the OLE::Storage_Lite will show you the File structure and the PPS names. For example:

perl Book1.xls 00 1 'Root Entry' (pps 0) ROOT 00.01.1900 +00:00:00 01 1 'Workbook' (pps 1) FILE 1000 +bytes 02 2 ' SummaryInformation' (pps 2) FILE 1000 +bytes 03 3 ' DocumentSummaryInformation' (pps 3) FILE 1000 +bytes

Then you can extract the PPS structures using OLE::Storage_Lite. Here is a sample program that extracts the "Summary Information" from an Excel file to get you started.

#!/usr/bin/perl use strict; use warnings; use OLE::Storage_Lite; my $file = 'Book1.xls'; my $stream_name = "\5SummaryInformation"; # Convert stream name to UTF16. $stream_name = pack 'v*', unpack 'C*', $stream_name; # Create the OLE reader object. my $ole = OLE::Storage_Lite->new($file); # Find the required stream in the OLE container. my $stream = ($ole->getPpsSearch([$stream_name], 1, 1))[0]; die "Couldn't find required OLE data in $file. $!\n" unless $strea +m; # Do something with the data. my $data = $stream->{Data}; # Remember to use binmode() on Windows. print $data;
Note, if the PPS name appears to start with a space it may actually be a low ordinal character such as "\0", "\1" or as in the case above "\5".


Replies are listed 'Best First'.
Re^2: Spreadsheet::ParseExcel with embedded PDF cells
by ForgotPasswordAgain (Deacon) on Jan 21, 2009 at 17:20 UTC
    Probably nobody reading this now, but... I seem to be unable to associate the PDF files that I (successfully) extracted to the cells they're coming from. Is there any way to do that?
      If you send me an example file using the email address in the OLE::Storage_Lite docs I'll have a look at it and see if the cell addresses can be decoded out using ParseExcel.


Re^2: Spreadsheet::ParseExcel with embedded PDF cells
by ForgotPasswordAgain (Deacon) on Jan 11, 2009 at 17:04 UTC
    Thanks, that looks very promising, if I can figure out the stream name for where the PDFs are.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://735234]
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others examining the Monastery: (5)
As of 2018-01-21 18:48 GMT
Find Nodes?
    Voting Booth?
    How did you see in the new year?

    Results (229 votes). Check out past polls.