Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

Read PDF files & do regex through Perl.

by an_ordinary_man (Initiate)
on Feb 11, 2002 at 23:03 UTC ( [id://144729]=perlquestion: print w/replies, xml ) Need Help??

an_ordinary_man has asked for the wisdom of the Perl Monks concerning the following question:

Hi All.
I want to extract information given in annotations (comments) inside PDF files, using regex.
When I open the PDF file in EditPlus it shows me a large file with about 10,000 lines containing the data I need & also lots of text that looks like junk, but when I read it through Perl it just reads about 29 lines & lots of the junk from PDF is missing.
I tried the texttopdf modules availabe, but they do not give me the contents in the annotations.
open (PDFTYPELOG, "+>pdftype.txt") or die "cannot open file errorl +og.txt for writing : $!"; open (FILE_HANDLE, "D:\\MyScripts\\Social008-017.pdf") or die "can +not Social008-017.pdf for reading : $!"; while ($LineContent = <FILE_HANDLE>) { print PDFTYPELOG ($LineContent); }
Please tell me how can I read the whole file line by line (if possible).
Regards.
An_Ordinary_Man

Replies are listed 'Best First'.
Re: Read PDF files & do regex through Perl.
by rjray (Chaplain) on Feb 11, 2002 at 23:47 UTC

    A PDF file is not a plain text file. It is a fairly complex binary format, so reading it with normal line-oriented I/O will not work.

    Look into the PDF-oriented modules on cpan (http://search.cpan.org/search?mode=module&query=PDF), or for PDF tools on Freshmeat.net, which you could use to pre-process the PDF, extracting the parts you want, which may then be handled by the Perl script.

    --rjray

Re: Read PDF files & do regex through Perl.
by beebware (Pilgrim) on Feb 11, 2002 at 23:48 UTC

    You may find the Adobe official PDF Reference books handy - they are available for free download (weighing in a 9Mb) from Adobe's website. Sanface have an early development version of a PDF-lib pdf comment extractor which may help guide you in the right direction.

    You may also find it useful to see how the code in the programs/scripts referenced from this article work, and see if you can 'tweak' it to extract the data instead of placing it in.

    I personally have had experience of a commercial ($40k) package which converted the PDF to raw XML data, but IIRC - even that didn't cope with the annotations in the file. I think the aforementioned PDF specification is your best bet.

Re: Read PDF files & do regex through Perl.
by YuckFoo (Abbot) on Feb 12, 2002 at 02:53 UTC
    When reading binary files on Windows, you need to set 'binmode' immediately after opening the file. You are reading (and processing) a false 'end-of-file' after 29 lines.

    For more information, see 'perldoc -f binmode'.

    Then you can figure out how to process the binary data.

    YuckFoo

Re: Read PDF files & do regex through Perl.
by aj (Initiate) on Feb 12, 2002 at 10:24 UTC
    here is a place to get pdf files converted to html and or text. Just attach the required files and email it to all 3 places. You will get back a .txt file, and an .html file. Please note: "locked" files cannot be opened unless you use a "special" program..... write me: moic@mail.com and I can give you the details.... mailto: Site Address http://24.182.240.51/pdf/HiPdProf.pdf pdf2html@adobe.com convert by email to html pdf2txt@adobe.com converts to .txt by email

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://144729]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (None)
    As of 2025-03-21 06:06 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?
      When you first encountered Perl, which feature amazed you the most?










      Results (63 votes). Check out past polls.