in reply to Re^2: CAM::PDF extract text and their coordinates from pdf..
in thread CAM::PDF extract text and their coordinates from pdf..

Hi Umesh,

Yes, that's the same point that I got to.

In practice, you end up with a lot of text fragments that need to be reassembled into words and lines. Putting these back together into words and lines is a fair bit of work and can involve some heuristics.

Rather than continuing to develop the above, I personally went with pstotext from the Ghostscript suite; it has a `-bboxes` option to output text positions and does attempt to assemble words and lines. Despite it's name it will work on pdf files.

Another program I looked at was pdfminer.

One of these, or something similar, might work. It's just a matter of how good a job they do.

- David

  • Comment on Re^3: CAM::PDF extract text and their coordinates from pdf..

Replies are listed 'Best First'.
Re^4: CAM::PDF extract text and their coordinates from pdf..
by umesh_epub (Novice) on Jan 10, 2013 at 13:04 UTC

    Thanks David

    I will look pdfminer and pstotext

    I have searched pstotext in my Ghostscript "GPL Ghostscript 8.70 (2009-07-31)" But that command is not available.

    In which version of the GS "pstotext" available.

    Thanks,
    Umesh

      Hi Umesh,

      It uses Ghostscript, but needs to be installed as a separate package. I'm running on debian which had the `pstotext` package readily available.

      But the source seems to be getting harder to find. Slackware has an archive.