Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

How to extract image captions from a PDF file using perl

by Anonymous Monk
on Nov 16, 2010 at 17:48 UTC ( #871788=perlquestion: print w/replies, xml ) Need Help??
Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi friends... I had some set of pdf articles and i want to retrieve the image and the caption below it.How to do it???Please give me suggestions
  • Comment on How to extract image captions from a PDF file using perl

Replies are listed 'Best First'.
Re: How to extract image captions from a PDF file using perl
by MidLifeXis (Monsignor) on Nov 16, 2010 at 18:17 UTC

    PDF modules on CPAN would probably be a good start. CAM::PDF, iirc, can do that (well, the image part - the caption is iffy). Also see HTML::HTMLDoc. (what was I yammering here?)

    --MidLifeXis

Re: How to extract image captions from a PDF file using perl
by LanX (Bishop) on Nov 16, 2010 at 18:56 UTC
    Normally captions have a separate font-setting, which should help identifying them, especially when located near to an image.

    See "Parsing PDFs by text position?" and included links for a start. HTH!

    Cheers Rolf

Re: How to extract image captions from a PDF file using perl
by chrestomanci (Priest) on Nov 17, 2010 at 10:01 UTC

    Perhaps you could convert your PDF files to SVG using inkscape, and then parse the resultant SVG using one of the standard XML processing libraries.

    Inkscape has a command line mode that can do almost anything that you can do with the GUI.

    inkscape -f Input_file.pdf -l Output_file.svg

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://871788]
Approved by davido
help
Chatterbox?
[jdporter]: let me google that for me ;-)
[jdporter]: hex points explicitly to oct, which does the job. :-D
[jdporter]: omg, I f love Perl!
[choroba]: say unpack 'H*', pack 'B*', $mask =~ /0b([01]+)/;
[choroba]: use C instead of H to get the decimal number
[erix]: ( no love like f love )
[choroba]: f* love

How do I use this? | Other CB clients
Other Users?
Others surveying the Monastery: (9)
As of 2018-02-20 16:35 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    When it is dark outside I am happiest to see ...














    Results (272 votes). Check out past polls.

    Notices?