Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

How to extract image captions from a PDF file using perl

by Anonymous Monk
on Nov 16, 2010 at 17:48 UTC ( #871788=perlquestion: print w/ replies, xml ) Need Help??
Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi friends... I had some set of pdf articles and i want to retrieve the image and the caption below it.How to do it???Please give me suggestions

Comment on How to extract image captions from a PDF file using perl
Re: How to extract image captions from a PDF file using perl
by MidLifeXis (Prior) on Nov 16, 2010 at 18:17 UTC

    PDF modules on CPAN would probably be a good start. CAM::PDF, iirc, can do that (well, the image part - the caption is iffy). Also see HTML::HTMLDoc. (what was I yammering here?)

    --MidLifeXis

Re: How to extract image captions from a PDF file using perl
by LanX (Canon) on Nov 16, 2010 at 18:56 UTC
    Normally captions have a separate font-setting, which should help identifying them, especially when located near to an image.

    See "Parsing PDFs by text position?" and included links for a start. HTH!

    Cheers Rolf

Re: How to extract image captions from a PDF file using perl
by chrestomanci (Priest) on Nov 17, 2010 at 10:01 UTC

    Perhaps you could convert your PDF files to SVG using inkscape, and then parse the resultant SVG using one of the standard XML processing libraries.

    Inkscape has a command line mode that can do almost anything that you can do with the GUI.

    inkscape -f Input_file.pdf -l Output_file.svg

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://871788]
Approved by davido
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others meditating upon the Monastery: (4)
As of 2014-07-13 07:42 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    When choosing user names for websites, I prefer to use:








    Results (248 votes), past polls