Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

Re^3: Word Frequency in Particular Sentences

by nefigah (Monk)
on Mar 28, 2008 at 06:25 UTC ( #676918=note: print w/ replies, xml ) Need Help??


in reply to Re^2: Word Frequency in Particular Sentences
in thread Word Frequency in Particular Sentences

And here is some code for getting the text out of a PDF, using an excellent little CPAN module called CAM::PDF. (If you don't know how to install CPAN modules, just ask).

This goes through a PDF page-by-page, grabbing the text, and then saves it all to a text file. Note that if your PDF is huge you may want to modify this to do it in chunks (the 367 page PDF I tested it on only took a few seconds, though).

#!/usr/bin/perl + use warnings; use strict; use CAM::PDF; my $pdf_path = $ARGV[0] or die "No pdf specified"; my $pdf = CAM::PDF->new($pdf_path); my $text = ''; for my $page (1..$pdf->numPages) { $text .= $pdf->getPageText($page); } open my $file, '>', 'pdftext.txt'; print $file $text; close $file;


I'm a peripheral visionary... I can see into the future, but just way off to the side.


Comment on Re^3: Word Frequency in Particular Sentences
Download Code
Re^4: Word Frequency in Particular Sentences
by Anonymous Monk on Mar 28, 2008 at 16:40 UTC
    Thanks to all of you for your help. I appreciate it. Two quick comments-- (i) Regarding the abbreviation problem, a quick manual scan indicates that all Asian sentences are (thankfully) bound by a period. (ii) How do you, in fact, install CPAN modules? I tried once, but got some error message along the lines of "file not found". Thanks again.

      What operating system are you using? General instructions are here, but I think they're a bit old (the *nix ones look fine though at least). I believe on Windows there is a Perl package manager that you would use.

      If you are having trouble, you should register an account here and send me a message and I'll try and walk you through it.


      I'm a peripheral visionary... I can see into the future, but just way off to the side.

        OK.
      (ii) How do you, in fact, install CPAN modules? I tried once, but got some error message along the lines of "file not found".

      Please don't overlook our very fine Tutorials: Installing Modules should be very helpful to you.

      HTH,

      planetscape

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://676918]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others rifling through the Monastery: (3)
As of 2014-09-20 00:58 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (151 votes), past polls