|Welcome to the Monastery|
Extracting content text from PDFsby pat_mc (Pilgrim)
|on Apr 01, 2008 at 15:36 UTC||Need Help??|
pat_mc has asked for the wisdom of the Perl Monks concerning the following question:
Hi All -
I am trying to extract content such as the document title or the content text from PDF files (ultimately hoping to search or categorise my collection of PDFs). So far, I have attempted to parse the PDF source file with regular expressions. While I notice that PDF section titles often come with the tag /Title this does not seem to be the case always - and hence does not constitute a reliable approach for parsing the PDF.
Do you know of any reliable Perl approaches (e. g. suitable modules) for handling PDFs?
Thanks in advance for your help!