http://www.perlmonks.org?node_id=676918


in reply to Re^2: Word Frequency in Particular Sentences
in thread Word Frequency in Particular Sentences

And here is some code for getting the text out of a PDF, using an excellent little CPAN module called CAM::PDF. (If you don't know how to install CPAN modules, just ask).

This goes through a PDF page-by-page, grabbing the text, and then saves it all to a text file. Note that if your PDF is huge you may want to modify this to do it in chunks (the 367 page PDF I tested it on only took a few seconds, though).

#!/usr/bin/perl + use warnings; use strict; use CAM::PDF; my $pdf_path = $ARGV[0] or die "No pdf specified"; my $pdf = CAM::PDF->new($pdf_path); my $text = ''; for my $page (1..$pdf->numPages) { $text .= $pdf->getPageText($page); } open my $file, '>', 'pdftext.txt'; print $file $text; close $file;


I'm a peripheral visionary... I can see into the future, but just way off to the side.

Replies are listed 'Best First'.
Re^4: Word Frequency in Particular Sentences
by Anonymous Monk on Mar 28, 2008 at 16:40 UTC
    Thanks to all of you for your help. I appreciate it. Two quick comments-- (i) Regarding the abbreviation problem, a quick manual scan indicates that all Asian sentences are (thankfully) bound by a period. (ii) How do you, in fact, install CPAN modules? I tried once, but got some error message along the lines of "file not found". Thanks again.
      (ii) How do you, in fact, install CPAN modules? I tried once, but got some error message along the lines of "file not found".

      Please don't overlook our very fine Tutorials: Installing Modules should be very helpful to you.

      HTH,

      planetscape

      What operating system are you using? General instructions are here, but I think they're a bit old (the *nix ones look fine though at least). I believe on Windows there is a Perl package manager that you would use.

      If you are having trouble, you should register an account here and send me a message and I'll try and walk you through it.


      I'm a peripheral visionary... I can see into the future, but just way off to the side.

        OK.