http://www.perlmonks.org?node_id=1022468

welle has asked for the wisdom of the Perl Monks concerning the following question:

Dear monks

I need to extract text from a ppt file, just text without any formatting information. I know that pptx files are zip files and that the text is contained in xml files in ppt/slides/

Before I reinvent the wheel (I think the text data in the xml file may be more complex than I think now), maybe it is a task which as been already accomplished, so I'd like to ask if someone can point me to script snippets on the Web. I've already found this which could be a very good starting point, but it is not only perl.

Any suggestions?

UPDATE

I came up with this (partial) solution. Just Perl, also for Windows:

use Archive::Zip qw( :ERROR_CODES ); use Win32::OLE; my $mainfilepath="test.pptx"; my $zip = Archive::Zip->new(); # read the Word document, that is the ZIP file $zip->read( $mainfilepath ) == AZ_OK or die "Unable to open Office fil +e\n"; my $wfh = $zip->extractMember( 'ppt/slides/slide1.xml' ); # Routine for reading file into $content variable. # Source is Perl Monks: http://www.perlmonks.org/?node_id=1952 # { local $/=undef; open (FILE,"<:utf8","ppt/slides/slide1.xml") || (message_error01()) +; binmode FILE; $content = <FILE>; close FILE; } my $nl = "\n"; # Alternative is "\r\n". my $lindent = " "; # Indent nested lists by "\t", " " etc. my $lwidth = 80; # Line width, used for short line justification. # ToDo: Better list handling. Currently assumed 8 level nesting. my @levchar = ('*', '+', 'o', '-', '**', '++', 'oo', '--'); # # Text extraction starts. # $content =~ s/<?xml .*?\?>(\r)?\n//; $content =~ s{<w:p [^/>]+?/>|</w:p>}|$nl|og; $content =~ s|<w:br/>|$nl|og; $content =~ s|<w:tab/>|\t|og; my $hr = '-' x 78 . $nl; $content =~ s|<w:pBdr>.*?</w:pBdr>|$hr|og; $content =~ s|<w:numPr><w:ilvl w:val="([0-9]+)"/>|$lindent x $1 . "$le +vchar[$1] "|oge; # # Uncomment either of below two lines and comment above line, if deali +ng # with more than 8 level nested lists. # # $content =~ s|<w:numPr><w:ilvl w:val="([0-9]+)"/>|$lindent x $1 . '* + '|oge; # $content =~ s|<w:numPr><w:ilvl w:val="([0-9]+)"/>|'*' x ($1+1) . ' ' +|oge; $content =~ s{<w:caps/>.*?(<w:t>|<w:t [^>]+>)(.*?)</w:t>}/uc $2/oge; $content =~ s{<w:pPr><w:jc w:val="center"/></w:pPr><w:r><w:t>(.*?)</w: +t></w:r>}/cjustify($1)/oge; $content =~ s{<w:pPr><w:jc w:val="right"/></w:pPr><w:r><w:t>(.*?)</w:t +></w:r>}/rjustify($1)/oge; $content =~ s{<w:hyperlink r:id="(.*?)".*?>(.*?)</w:hyperlink>}/hyperl +ink($1,$2)/oge; #$content =~ s/<.*?>//g; $content =~ s/<.*?>/ /g; #Substitute tags with white spaces, in order +to always have white spaces between words # # Convert non-ASCII characters/character sequences to ASCII characters +. # # $content =~ s/\xE2\x82\xAC/\xC8/og; # euro symbol as saved by MSO +ffice $content =~ s/\xE2\x82\xAC/E/og; # euro symbol expressed as E $content =~ s/\xE2\x80\xA6/.../og; $content =~ s/\xE2\x80\xA2/::/og; # four dot diamond symbol $content =~ s/\xE2\x80\x9C/"/og; $content =~ s/\xE2\x80\x99/'/og; $content =~ s/\xE2\x80\x98/'/og; $content =~ s/\xE2\x80\x93/-/og; $content =~ s/\xC2\xA0//og; $content =~ s/&amp;/&/ogi; $content =~ s/&lt;/</ogi; $content =~ s/&gt;/>/ogi; $content =~ s/\s+/ /g;#delete extra white spaces my @row = split(/\n/, $content);

The script extracts the content of the first slide (I must change the initial part to reiterate for any slide) and find a way to insert a separator between two sections (fields) of a slide, as words are now without any white space between them.