Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

Re^3: PDF::OCR2 results not what I was hoping for

by Corion (Patriarch)
on Feb 08, 2016 at 17:02 UTC ( [id://1154645]=note: print w/replies, xml ) Need Help??


in reply to Re^2: PDF::OCR2 results not what I was hoping for
in thread PDF::OCR2 results not what I was hoping for

Reading the documentation of PDF::OCR2, I get the impression that it converts the PDF pages into separate image files using PDF::GetImages and then uses Image::OCR::Tesseract to get the text from the image.

I would change that to add a cropping step in between, which selects only the "interesting" part of the image.

  • Comment on Re^3: PDF::OCR2 results not what I was hoping for

Replies are listed 'Best First'.
Re^4: PDF::OCR2 results not what I was hoping for
by nysus (Parson) on Feb 08, 2016 at 18:35 UTC

    Bam! Got it. I set the "density" setting to "300x300" when reading the image in, by default it is set to 72 dpi.

    PDF::OCR2 is now reading the text in the cropped rectangle flawlessly.

    Thanks for pointing me in the right direction.

    Here is the sample code:

    use Image::Magick; use PDF::OCR2; my $image = Image::Magick->new; $image->Set(density=>'300x300'); $image->Read('agendas/2016-02-02 Natural Resources.pdf', compression=> +'None'); $image->Crop(geometry=>'1248x520+936+520'); $image->Write(filename=>'crop.pdf', compression=>'None'); my $p = PDF::OCR2->new('crop.pdf'); my $text_all = $p->text; print $text_all;

    $PM = "Perl Monk's";
    $MCF = "Most Clueless Friar Abbot Bishop Pontiff Deacon";
    $nysus = $PM . $MCF;
    Click here if you love Perl Monks

Re^4: PDF::OCR2 results not what I was hoping for
by nysus (Parson) on Feb 08, 2016 at 18:17 UTC
    Thanks, yeah, I'm getting very close now. I'm at least getting some usable output after using Image::Magick to crop the pdf. The only problem I'm having is that imagemagick seems to read in the image at very low quality so the OCR results are suboptimal. Hopefully there is some setting I can use to address this.

    $PM = "Perl Monk's";
    $MCF = "Most Clueless Friar Abbot Bishop Pontiff Deacon";
    $nysus = $PM . $MCF;
    Click here if you love Perl Monks

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1154645]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others studying the Monastery: (2)
As of 2024-07-16 03:54 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?
    erzuuli‥ 🛈The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.