Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

pdf2txt?

by tbone1 (Monsignor)
on Oct 09, 2003 at 18:12 UTC ( #298041=perlquestion: print w/replies, xml ) Need Help??

tbone1 has asked for the wisdom of the Perl Monks concerning the following question:

Fellow monks -

I've been handed a project which requires that text stored in a .pdf file be extracted and then folded, spindled, mutilated, etc. I know of the PDF modules and pdf2html, but they aren't quite what I need. I've searched supersearched here and pored over freshmeat.net, but I cannot find such a beast. I know I can't be the first person to want to reduce the data in a .pdf file to its ascii equivalent. I'm under a bit of a crunch on this, so I don't think that I have time to write the converter myself.

Does anyone know of such a converter that already exists? And if so, where is it?

--
tbone1
Ain't enough 'O's in 'stoopid' to describe that guy.
- Dave "the King" Wilson

Replies are listed 'Best First'.
Re: pdf2txt?
by duct_tape (Hermit) on Oct 09, 2003 at 18:20 UTC

    xpdf comes with a utility called pdftotext. Perhaps that is what you are looking for? Also, searching google for 'pdftotext' I found another application that looks to do the same thing on windows.

    xpdf can be found at: http://www.foolabs.com/xpdf/

    Hope that helps.
    Brad

      It looks like the best solution; thanks. I was hoping for something Perlish that I could use deep in my Perl script, but sometimes you just have to play the hand you're dealt.

      --
      tbone1
      Ain't enough 'O's in 'stoopid' to describe that guy.
      - Dave "the King" Wilson

Re: pdf2txt?
by cbraga (Pilgrim) on Oct 09, 2003 at 19:08 UTC
    While it may not be the perfect solution, you can always convert the pdf to html with the mentioned program and them convert the html to text.

      I'd tried that, but some of the .pdf files have ugly tables in them, and the tables created by pdftohtml are, um, unpretty. In fact, predicting the output and formats was a royal mess.

      --
      tbone1
      Ain't enough 'O's in 'stoopid' to describe that guy.
      - Dave "the King" Wilson

Re: pdf2txt?
by freddo411 (Chaplain) on Oct 09, 2003 at 22:14 UTC
    You have a very difficult job in front of you. PDF isn't a format that translates back nicely into ASCII.

    I know for certain that if you have a long paragraph that is visually wrapped into several lines in a PDF, that the text that composes the paragraph is broken up into several strings (well, however many lines there are). This presents problems when you want to sensibly save simple ASCII back out.

    There are other issues as well, having to do primarally with getting the text in the correct order in the ASCII file.

    Unless you are "cherry picking" a string or two, you'll be happier if you can redefine your problem in another way....

    Cheers

    -------------------------------------
    Nothing is too wonderful to be true
    -- Michael Faraday

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://298041]
Approved by jdtoronto
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others meditating upon the Monastery: (3)
As of 2021-12-02 06:25 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    R or B?



    Results (17 votes). Check out past polls.

    Notices?