Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?

Re^2: CAM::PDF extract text and their coordinates from pdf..

by umesh_epub (Novice)
on Jan 10, 2013 at 05:39 UTC ( #1012591=note: print w/replies, xml ) Need Help??

in reply to Re: CAM::PDF extract text and their coordinates from pdf..
in thread CAM::PDF extract text and their coordinates from pdf..

Hi Snoopy,
Thanks for your kind replay. How to know line start and line end.
Which material we have to study for doing pdf operations.
  • Comment on Re^2: CAM::PDF extract text and their coordinates from pdf..

Replies are listed 'Best First'.
Re^3: CAM::PDF extract text and their coordinates from pdf..
by snoopy (Deacon) on Jan 10, 2013 at 05:58 UTC
    Hi Umesh,

    Yes, that's the same point that I got to.

    In practice, you end up with a lot of text fragments that need to be reassembled into words and lines. Putting these back together into words and lines is a fair bit of work and can involve some heuristics.

    Rather than continuing to develop the above, I personally went with pstotext from the Ghostscript suite; it has a `-bboxes` option to output text positions and does attempt to assemble words and lines. Despite it's name it will work on pdf files.

    Another program I looked at was pdfminer.

    One of these, or something similar, might work. It's just a matter of how good a job they do.

    - David

      Thanks David

      I will look pdfminer and pstotext

      I have searched pstotext in my Ghostscript "GPL Ghostscript 8.70 (2009-07-31)" But that command is not available.

      In which version of the GS "pstotext" available.


        Hi Umesh,

        It uses Ghostscript, but needs to be installed as a separate package. I'm running on debian which had the `pstotext` package readily available.

        But the source seems to be getting harder to find. Slackware has an archive.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1012591]
[marioroy]: why do some posters not know to provide a good SSCCE? Nested loops?
[Corion]: marioroy: Producing good SSCCE is an acquired skill :) You have to produce bad SSCCEs for a long time until you get good. And when you get good, you don't need them that much anymore because you simply isolate the problem and then solve it ;)
[marioroy]: ... and kcott is able to provide something. Amazing.
[hippo]: kcott's analysis consistently impresses.

How do I use this? | Other CB clients
Other Users?
Others contemplating the Monastery: (8)
As of 2017-08-18 08:19 GMT
Find Nodes?
    Voting Booth?
    Who is your favorite scientist and why?

    Results (297 votes). Check out past polls.