Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

PDF::OCR2 results not what I was hoping for

by nysus (Parson)
on Feb 08, 2016 at 15:33 UTC ( [id://1154639]=perlquestion: print w/replies, xml ) Need Help??

nysus has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to OCR this document.

The results are disappointing to say the least. The output was basically just random characters with lots of blank lines. None of the text appearing in the original PDF was recognized. When I tried on a "clean" document converted straight to PDF from a word processing document, the module worked fine. So apparently OCR2 just doesn't have the logic to pull text from more sophisticated documents that are scanned.

I know that the PDF::OCR2 just provides an interface to tesseract/imagemagick so this probably isn't the best forum for this question but I'm hoping someone can give me some advice that points me in the right direction.

I'm interested in pulling out the time and location data from the document.

$PM = "Perl Monk's";
$MCF = "Most Clueless Friar Abbot Bishop Pontiff Deacon";
$nysus = $PM . $MCF;
Click here if you love Perl Monks

Replies are listed 'Best First'.
Re: PDF::OCR2 results not what I was hoping for
by Corion (Patriarch) on Feb 08, 2016 at 15:45 UTC

    If you're trying OCR on a form, I think the best approach is to pre-segment the different areas where text appears. I found multi-column (or in your case, even multi-box) text to be highly confusing for the OCR programs I tried.

    As what you have is basically a form with more or less fixed offsets, I would try to extract the rectangle within which date/time/location appear and then do OCR on these strings. Also look into the settings of your OCR to find whether you can specify a sans-serif font.

      I didn't see anything in the PDF::OCR2 documentation that allowed you to just scan a portion of the document. How would I do this?

      $PM = "Perl Monk's";
      $MCF = "Most Clueless Friar Abbot Bishop Pontiff Deacon";
      $nysus = $PM . $MCF;
      Click here if you love Perl Monks

        Reading the documentation of PDF::OCR2, I get the impression that it converts the PDF pages into separate image files using PDF::GetImages and then uses Image::OCR::Tesseract to get the text from the image.

        I would change that to add a cropping step in between, which selects only the "interesting" part of the image.

      Maybe I would use imagemagick to crop the pdf?

      $PM = "Perl Monk's";
      $MCF = "Most Clueless Friar Abbot Bishop Pontiff Deacon";
      $nysus = $PM . $MCF;
      Click here if you love Perl Monks

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1154639]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others exploiting the Monastery: (3)
As of 2024-03-28 17:17 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found