Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask

Extracting text from pptx

by welle (Beadle)
on Mar 08, 2013 at 17:16 UTC ( #1022468=perlquestion: print w/replies, xml ) Need Help??

welle has asked for the wisdom of the Perl Monks concerning the following question:

Dear monks

I need to extract text from a ppt file, just text without any formatting information. I know that pptx files are zip files and that the text is contained in xml files in ppt/slides/

Before I reinvent the wheel (I think the text data in the xml file may be more complex than I think now), maybe it is a task which as been already accomplished, so I'd like to ask if someone can point me to script snippets on the Web. I've already found this which could be a very good starting point, but it is not only perl.

Any suggestions?


I came up with this (partial) solution. Just Perl, also for Windows:

use Archive::Zip qw( :ERROR_CODES ); use Win32::OLE; my $mainfilepath="test.pptx"; my $zip = Archive::Zip->new(); # read the Word document, that is the ZIP file $zip->read( $mainfilepath ) == AZ_OK or die "Unable to open Office fil +e\n"; my $wfh = $zip->extractMember( 'ppt/slides/slide1.xml' ); # Routine for reading file into $content variable. # Source is Perl Monks: # { local $/=undef; open (FILE,"<:utf8","ppt/slides/slide1.xml") || (message_error01()) +; binmode FILE; $content = <FILE>; close FILE; } my $nl = "\n"; # Alternative is "\r\n". my $lindent = " "; # Indent nested lists by "\t", " " etc. my $lwidth = 80; # Line width, used for short line justification. # ToDo: Better list handling. Currently assumed 8 level nesting. my @levchar = ('*', '+', 'o', '-', '**', '++', 'oo', '--'); # # Text extraction starts. # $content =~ s/<?xml .*?\?>(\r)?\n//; $content =~ s{<w:p [^/>]+?/>|</w:p>}|$nl|og; $content =~ s|<w:br/>|$nl|og; $content =~ s|<w:tab/>|\t|og; my $hr = '-' x 78 . $nl; $content =~ s|<w:pBdr>.*?</w:pBdr>|$hr|og; $content =~ s|<w:numPr><w:ilvl w:val="([0-9]+)"/>|$lindent x $1 . "$le +vchar[$1] "|oge; # # Uncomment either of below two lines and comment above line, if deali +ng # with more than 8 level nested lists. # # $content =~ s|<w:numPr><w:ilvl w:val="([0-9]+)"/>|$lindent x $1 . '* + '|oge; # $content =~ s|<w:numPr><w:ilvl w:val="([0-9]+)"/>|'*' x ($1+1) . ' ' +|oge; $content =~ s{<w:caps/>.*?(<w:t>|<w:t [^>]+>)(.*?)</w:t>}/uc $2/oge; $content =~ s{<w:pPr><w:jc w:val="center"/></w:pPr><w:r><w:t>(.*?)</w: +t></w:r>}/cjustify($1)/oge; $content =~ s{<w:pPr><w:jc w:val="right"/></w:pPr><w:r><w:t>(.*?)</w:t +></w:r>}/rjustify($1)/oge; $content =~ s{<w:hyperlink r:id="(.*?)".*?>(.*?)</w:hyperlink>}/hyperl +ink($1,$2)/oge; #$content =~ s/<.*?>//g; $content =~ s/<.*?>/ /g; #Substitute tags with white spaces, in order +to always have white spaces between words # # Convert non-ASCII characters/character sequences to ASCII characters +. # # $content =~ s/\xE2\x82\xAC/\xC8/og; # euro symbol as saved by MSO +ffice $content =~ s/\xE2\x82\xAC/E/og; # euro symbol expressed as E $content =~ s/\xE2\x80\xA6/.../og; $content =~ s/\xE2\x80\xA2/::/og; # four dot diamond symbol $content =~ s/\xE2\x80\x9C/"/og; $content =~ s/\xE2\x80\x99/'/og; $content =~ s/\xE2\x80\x98/'/og; $content =~ s/\xE2\x80\x93/-/og; $content =~ s/\xC2\xA0//og; $content =~ s/&amp;/&/ogi; $content =~ s/&lt;/</ogi; $content =~ s/&gt;/>/ogi; $content =~ s/\s+/ /g;#delete extra white spaces my @row = split(/\n/, $content);

The script extracts the content of the first slide (I must change the initial part to reiterate for any slide) and find a way to insert a separator between two sections (fields) of a slide, as words are now without any white space between them.

Replies are listed 'Best First'.
Re: Extracting text from pptx
by Corion (Pope) on Mar 08, 2013 at 18:03 UTC

    What part of the linked program do you have problems with?

      Please, consider that I'm a novice with Perl

      I just can't understand the whole script: where is the unzipping part? If I run the script (on Windows) I simply get an error message (System not able to find the directory + Failed to extract required information from <file>). It must - I guess - with the script setting:

      my $unzip = "/usr/bin/unzip"; to do

        So, what have you done to find out where $unzip is then used?

        Also, the error message would suggest to me that somewhere, the program expects some other program, possibly unzip.exe to exist. What have you done to find out whether that is really the case?

        Likely an unzip utility can be found in the unxutils package.

        that looks like an Unix comand!
        (ok, that IS an unix path to a program, you'll want to find the call for something like 7zip or the likes and replace  "/usr/bin/unzip/ it with that).
        J -
Re: Extracting text from pptx
by vagabonding electron (Deacon) on Apr 18, 2013 at 15:15 UTC
    I am a beginner too, but here is my attempt. It is very quick and very dirty (since it takes header and footer from each slide as text as well as the text) but it does extract the text from the pptx (at least from my pptx :-) ) and perhaps it could be useful.
    #!/perl use strict; use warnings; use Archive::Zip qw( :ERROR_CODES ); use XML::Twig; my @text; my $file = "The full path to your pptx"; my $zip = Archive::Zip->new(); $zip->read( $file ) == AZ_OK or die "Unable to open Office file\n"; my @slides = $zip->membersMatching( "ppt/slides/slide.+\.xml" ); for my $i ( 1 .. scalar @slides ) # to sort them. { my $content = $zip->contents( "ppt/slides/slide${i}.xml"); my $twig= XML::Twig->new( #keep_encoding=>1, twig_handlers => { 'a:t' => \&topicref_processing, }, ); $twig->parse( $content ); } sub topicref_processing { my($twig, $ppttext) = @_; push @text, $ppttext->text(); } use FindBin qw($Bin); open my $out, ">:encoding(UTF-8)", "$Bin/test_ppt_PM.txt" or die "$!"; print {$out} "$_\n" for @text;

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1022468]
Approved by Corion
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others perusing the Monastery: (4)
As of 2021-05-11 21:00 GMT
Find Nodes?
    Voting Booth?
    Perl 7 will be out ...

    Results (122 votes). Check out past polls.