Extracting text from pptx

welle has asked for the wisdom of the Perl Monks concerning the following question:

Dear monks

I need to extract text from a ppt file, just text without any formatting information. I know that pptx files are zip files and that the text is contained in xml files in ppt/slides/

Before I reinvent the wheel (I think the text data in the xml file may be more complex than I think now), maybe it is a task which as been already accomplished, so I'd like to ask if someone can point me to script snippets on the Web. I've already found this which could be a very good starting point, but it is not only perl.

Any suggestions?

UPDATE

I came up with this (partial) solution. Just Perl, also for Windows:

use Archive::Zip qw( :ERROR_CODES );
use Win32::OLE;

my $mainfilepath="test.pptx";
my $zip = Archive::Zip->new();

# read the Word document, that is the ZIP file
$zip->read( $mainfilepath ) == AZ_OK or die "Unable to open Office fil
+e\n";
my $wfh = $zip->extractMember( 'ppt/slides/slide1.xml' );
 

# Routine for reading file into $content variable.  
# Source is Perl Monks: http://www.perlmonks.org/?node_id=1952
#
{
  local $/=undef;
  open (FILE,"<:utf8","ppt/slides/slide1.xml") || (message_error01()) 
+;
  binmode FILE;
  $content = <FILE>;
  close FILE;
}

my $nl = "\n";        # Alternative is "\r\n".
my $lindent = "  ";    # Indent nested lists by "\t", " " etc.
my $lwidth = 80;    # Line width, used for short line justification.

# ToDo: Better list handling. Currently assumed 8 level nesting.
my @levchar = ('*', '+', 'o', '-', '**', '++', 'oo', '--');
#
# Text extraction starts.
#

$content =~ s/<?xml .*?\?>(\r)?\n//;

$content =~ s{<w:p [^/>]+?/>|</w:p>}|$nl|og;
$content =~ s|<w:br/>|$nl|og;
$content =~ s|<w:tab/>|\t|og;

my $hr = '-' x 78 . $nl;
$content =~ s|<w:pBdr>.*?</w:pBdr>|$hr|og;

$content =~ s|<w:numPr><w:ilvl w:val="([0-9]+)"/>|$lindent x $1 . "$le
+vchar[$1] "|oge;

#
# Uncomment either of below two lines and comment above line, if deali
+ng
# with more than 8 level nested lists.
#

# $content =~ s|<w:numPr><w:ilvl w:val="([0-9]+)"/>|$lindent x $1 . '*
+ '|oge;
# $content =~ s|<w:numPr><w:ilvl w:val="([0-9]+)"/>|'*' x ($1+1) . ' '
+|oge;

$content =~ s{<w:caps/>.*?(<w:t>|<w:t [^>]+>)(.*?)</w:t>}/uc $2/oge;

$content =~ s{<w:pPr><w:jc w:val="center"/></w:pPr><w:r><w:t>(.*?)</w:
+t></w:r>}/cjustify($1)/oge;

$content =~ s{<w:pPr><w:jc w:val="right"/></w:pPr><w:r><w:t>(.*?)</w:t
+></w:r>}/rjustify($1)/oge;

$content =~ s{<w:hyperlink r:id="(.*?)".*?>(.*?)</w:hyperlink>}/hyperl
+ink($1,$2)/oge;

#$content =~ s/<.*?>//g;
$content =~ s/<.*?>/ /g; #Substitute tags with white spaces, in order 
+to always have white spaces between words 


#
# Convert non-ASCII characters/character sequences to ASCII characters
+.
#

# $content =~ s/\xE2\x82\xAC/\xC8/og;    # euro symbol as saved by MSO
+ffice
$content =~ s/\xE2\x82\xAC/E/og;    # euro symbol expressed as E

$content =~ s/\xE2\x80\xA6/.../og;
$content =~ s/\xE2\x80\xA2/::/og;    # four dot diamond symbol
$content =~ s/\xE2\x80\x9C/"/og;
$content =~ s/\xE2\x80\x99/'/og;
$content =~ s/\xE2\x80\x98/'/og;
$content =~ s/\xE2\x80\x93/-/og;

$content =~ s/\xC2\xA0//og;

$content =~ s/&amp;/&/ogi;
$content =~ s/&lt;/</ogi;
$content =~ s/&gt;/>/ogi;

$content =~ s/\s+/ /g;#delete extra white spaces

my @row = split(/\n/, $content);
[download]

The script extracts the content of the first slide (I must change the initial part to reiterate for any slide) ~~and find a way to insert a separator between two sections (fields) of a slide, as words are now without any white space between them.~~

Back to Seekers of Perl Wisdom