http://www.perlmonks.org?node_id=676869


in reply to Word Frequency in Particular Sentences

Everybody stand back! :)

As was well stated, someone has already written things to get text out of PDFs for you. The second problem of finding sentences with "Asia" in them is more interesting, and should be a good learning exercise for you.

So, pretending that you already have a plain text file full of, erm, text available, how would you go about identifying asian sentences in it? Do you have an idea how you would begin? (Trying to ascertain what you already know/what you have already written)


I'm a peripheral visionary... I can see into the future, but just way off to the side.

Replies are listed 'Best First'.
Re^2: Word Frequency in Particular Sentences
by Anonymous Monk on Mar 28, 2008 at 02:39 UTC
    Let me 'fess up. I was hoping that a similar problem has been solved already by someone and I could simply adapt that. As a well aged academic economist, I am way past trying to master Perl at any deep level. Still I will be grateful if you (or someone) could assure me that this is a doable problem in Perl and maybe point to a few functions/regular expressions(?) that may be used in this case. Thanks.
      OK ... here's a small code example to get you started. (You'll still want to hit CPAN for a PDF parsing module, though.)

      #!/usr/bin/perl -w use strict; use warnings; # Tell perl to split records on periods. $/ = '.'; my %words; # Read successive lines from our __DATA__section while (<DATA>) { # Skip the sentence unless it contains the text "asia" next unless m/asia/i; # Remove extraneous characters tr/a-zA-Z/ /cs; # Show each sentence we keep print "<$_>\n"; # Increment the counter for each word found map { $words{$_}++ } split; } print "\n\n" . "Count Word\n" . "----- -------------\n"; # Print all words in the sentences that appear more than once. for (sort keys %words) { next unless $words{$_} > 1; print "$words{$_}\t$_\n"; } __DATA__ Now is the time for all good. Men to come to the Asia of their party. The quick red fox jumped over the calico cat. One fish two fish asiatic fish blue fish. Zoom. When must we come to asia to see the fox? Dolum ipsum dolor est. Canem homo mordet. I would guess that few people speak latin in Asia. Perhaps many more asians speak greek. But how would I know?
      When run on my machine, it gives us:

      roboticus~ $ ./re_test.pl < Men to come to the Asia of their party > < One fish two fish asiatic fish blue fish > < When must we come to asia to see the fox Dolum ipsum dolor est > < I would guess that few people speak latin in Asia > < Perhaps many more asians speak greek > Count Word ----- ------------- 2 Asia 2 come 4 fish 2 speak 2 the 4 to roboticus~ $
      ...roboticus

      And here is some code for getting the text out of a PDF, using an excellent little CPAN module called CAM::PDF. (If you don't know how to install CPAN modules, just ask).

      This goes through a PDF page-by-page, grabbing the text, and then saves it all to a text file. Note that if your PDF is huge you may want to modify this to do it in chunks (the 367 page PDF I tested it on only took a few seconds, though).

      #!/usr/bin/perl + use warnings; use strict; use CAM::PDF; my $pdf_path = $ARGV[0] or die "No pdf specified"; my $pdf = CAM::PDF->new($pdf_path); my $text = ''; for my $page (1..$pdf->numPages) { $text .= $pdf->getPageText($page); } open my $file, '>', 'pdftext.txt'; print $file $text; close $file;


      I'm a peripheral visionary... I can see into the future, but just way off to the side.

        Thanks to all of you for your help. I appreciate it. Two quick comments-- (i) Regarding the abbreviation problem, a quick manual scan indicates that all Asian sentences are (thankfully) bound by a period. (ii) How do you, in fact, install CPAN modules? I tried once, but got some error message along the lines of "file not found". Thanks again.