Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW

pdf and ppt to text

by sarvan (Sexton)
on Aug 03, 2011 at 10:30 UTC ( #918221=perlquestion: print w/replies, xml ) Need Help??

sarvan has asked for the wisdom of the Perl Monks concerning the following question:

Hi there,

I am working on pdf's and ppt's. Now, i am in a need to extract the text from pdf's and ppt's inorder find some relevancy.

I need a way to extract the text from both forms. I tried the cpan module Text::pdf and other modules too.. But couldnt endup with expected result. The end result is i want text out of pdf and ppt

Can anyone suggest me in this also if i m wrong on my point..


Replies are listed 'Best First'.
Re: pdf and ppt to text
by moritz (Cardinal) on Aug 03, 2011 at 11:18 UTC
Re: pdf and ppt to text
by zentara (Archbishop) on Aug 03, 2011 at 11:26 UTC
Re: pdf and ppt to text
by LanX (Saint) on Aug 03, 2011 at 11:49 UTC
    I would try to make ppt produce pdf and then process the pdfs.

    you haven't specified which your "expected results" are, so I presume you need not only the text but also positional informations:

    So please see Parsing PDFs by text position? and the referenced older threads for various approaches.

    Cheers Rolf

Re: pdf and ppt to text
by Khen1950fx (Canon) on Aug 04, 2011 at 09:12 UTC
    To get text from a pdf, I use Text::FromAny.
    To get text from a ppt, I use catppt from catdoc.

    Prerequisites =>

    wish from Tcl
    catppt from catdoc

    Module Prerequisites =>

    #!/usr/bin/perl use strict; use warnings; use CPAN; CPAN::Shell->install(qw( XML::Twig Archive::Zip File::Temp Time::Local IO::File Any::Moose Try::Tiny Text::Extract::Word OpenOffice::OODoc File::LibMagic RTF::Parser HTML::FormatText::WithLinks CAM::PDF Text::FromAny));
    Once the prereqs are satisfied, run this:
    #!/usr/bin/perl use strict; use warnings; use File::Fetch; use Text::FromAny; my $ff1 = File::Fetch->new( uri => ' +.15/t/test.ppt'); my $ff2 = File::Fetch->new( uri => ' +.15/t/test.pdf'); my $where1 = $ff1->fetch( ) or die $ff1->error; my $where2 = $ff2->fetch( ) or die $ff2->error; my $tFromAny= Text::FromAny->new( file => 'test.pdf'); my $text = $tFromAny->text; print $text, "\n"; system("/usr/local/bin/catppt -lV"); print "\n"; system("/usr/local/bin/catppt test.ppt");
      Hi Khen1950fx,

      Thanks for the help.. and when i run dependency installation code "File::LibMagic" installation seems to fail.. So, i tried to install it separately.. even then when i try to run perl MakeFile.PL it shows an error called "cant include magic.h"

      what is the problem here..

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://918221]
Approved by marto
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others romping around the Monastery: (3)
As of 2023-12-09 15:18 GMT
Find Nodes?
    Voting Booth?
    What's your preferred 'use VERSION' for new CPAN modules in 2023?

    Results (38 votes). Check out past polls.