Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

How to extract image captions from a PDF file using perl

by Anonymous Monk
on Nov 16, 2010 at 17:48 UTC ( #871788=perlquestion: print w/ replies, xml ) Need Help??
Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi friends... I had some set of pdf articles and i want to retrieve the image and the caption below it.How to do it???Please give me suggestions

Comment on How to extract image captions from a PDF file using perl
Replies are listed 'Best First'.
Re: How to extract image captions from a PDF file using perl
by MidLifeXis (Monsignor) on Nov 16, 2010 at 18:17 UTC

    PDF modules on CPAN would probably be a good start. CAM::PDF, iirc, can do that (well, the image part - the caption is iffy). Also see HTML::HTMLDoc. (what was I yammering here?)

    --MidLifeXis

Re: How to extract image captions from a PDF file using perl
by LanX (Canon) on Nov 16, 2010 at 18:56 UTC
    Normally captions have a separate font-setting, which should help identifying them, especially when located near to an image.

    See "Parsing PDFs by text position?" and included links for a start. HTH!

    Cheers Rolf

Re: How to extract image captions from a PDF file using perl
by chrestomanci (Priest) on Nov 17, 2010 at 10:01 UTC

    Perhaps you could convert your PDF files to SVG using inkscape, and then parse the resultant SVG using one of the standard XML processing libraries.

    Inkscape has a command line mode that can do almost anything that you can do with the GUI.

    inkscape -f Input_file.pdf -l Output_file.svg

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://871788]
Approved by davido
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (8)
As of 2015-07-08 05:35 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (94 votes), past polls