Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?

Spreadsheet::ParseExcel with embedded PDF cells

by ForgotPasswordAgain (Deacon)
on Jan 09, 2009 at 13:29 UTC ( #735171=perlquestion: print w/replies, xml ) Need Help??
ForgotPasswordAgain has asked for the wisdom of the Perl Monks concerning the following question:

It happens that users can embed PDFs and images into cells in Excel. As far as I can tell, there's no support for that in Spreadsheet::ParseExcel. True? I'm on a linux system, so no chance of using OLE. Anyone know another way to extract those media files from the cells? This is the relevant part of what I have (basically from the SYNOPSIS):

my $formatter = Spreadsheet::ParseExcel::FmtUnicode->new(); my $workbook = Spreadsheet::ParseExcel::Workbook->Parse($file, $format +ter); foreach my $worksheet ($workbook->worksheets) { my ($row_min, $row_max) = $worksheet->row_range; my ($col_min, $col_max) = $worksheet->col_range; foreach my $row ($row_min .. $row_max) { foreach my $col ($col_min .. $col_max) { my $cell = $worksheet->get_cell($row, $col); next unless $cell; # this is what I imagine doing: if ($cell->{Type} eq 'pdf') { # get the PDF somehow .... } } } }

Replies are listed 'Best First'.
Re: Spreadsheet::ParseExcel with embedded PDF cells
by Corion (Pope) on Jan 09, 2009 at 13:38 UTC

    As far as I understood OLE embedding and OLE Document Streams (and that never was really far), an embedded object is mostly a screenshot of the representation and a serialized opaque blob of the embedded object. The target machine hopefully has the same OLE-enabled application (that is, Acrobat Reader in your case), and that application is basically called upon to refresh the "preview" bitmap or edit the embedded object if necessary.

    At least from this rudimentary understanding, I doubt that there will be a generic way to get at the "embedded file", as the data blob is mostly opaque. Possibly you can write a special case hack for getting at the PDF, as the PDF likely has a distinctive start signature (%%) and contains mostly text and gzipped blobs.

Re: Spreadsheet::ParseExcel with embedded PDF cells
by jmcnamara (Monsignor) on Jan 09, 2009 at 16:27 UTC

    The PDF files won't be embedded in the Excel document but rather in the OLE container/document that surrounds the Excel file.

    As such Spreadsheet::ParseExcel isn't of any use in this case. If you want to extract the PDF files you will need to use OLE::Storage_Lite.

    The first thing you will need to find out is the PPS (property set) name of the embedded objects. The utility that is part of the OLE::Storage_Lite will show you the File structure and the PPS names. For example:

    perl Book1.xls 00 1 'Root Entry' (pps 0) ROOT 00.01.1900 +00:00:00 01 1 'Workbook' (pps 1) FILE 1000 +bytes 02 2 ' SummaryInformation' (pps 2) FILE 1000 +bytes 03 3 ' DocumentSummaryInformation' (pps 3) FILE 1000 +bytes

    Then you can extract the PPS structures using OLE::Storage_Lite. Here is a sample program that extracts the "Summary Information" from an Excel file to get you started.

    #!/usr/bin/perl use strict; use warnings; use OLE::Storage_Lite; my $file = 'Book1.xls'; my $stream_name = "\5SummaryInformation"; # Convert stream name to UTF16. $stream_name = pack 'v*', unpack 'C*', $stream_name; # Create the OLE reader object. my $ole = OLE::Storage_Lite->new($file); # Find the required stream in the OLE container. my $stream = ($ole->getPpsSearch([$stream_name], 1, 1))[0]; die "Couldn't find required OLE data in $file. $!\n" unless $strea +m; # Do something with the data. my $data = $stream->{Data}; # Remember to use binmode() on Windows. print $data;
    Note, if the PPS name appears to start with a space it may actually be a low ordinal character such as "\0", "\1" or as in the case above "\5".


      Probably nobody reading this now, but... I seem to be unable to associate the PDF files that I (successfully) extracted to the cells they're coming from. Is there any way to do that?
        If you send me an example file using the email address in the OLE::Storage_Lite docs I'll have a look at it and see if the cell addresses can be decoded out using ParseExcel.


      Thanks, that looks very promising, if I can figure out the stream name for where the PDFs are.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://735171]
Approved by moritz
and a log crumbles through the grate...

How do I use this? | Other CB clients
Other Users?
Others pondering the Monastery: (8)
As of 2018-03-23 01:41 GMT
Find Nodes?
    Voting Booth?
    When I think of a mole I think of:

    Results (287 votes). Check out past polls.