Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?

PDF GetInfo(

by axelrose (Scribe)
on Apr 26, 2002 at 18:48 UTC ( [id://162366] : perlquestion . print w/replies, xml ) Need Help??

axelrose has asked for the wisdom of the Perl Monks concerning the following question:

Hello all,

is anybody using the CPAN module
The module works but cannot be run with warnings and strict.

Eventually all I need is getting the title info out of a PDF document.
Are there alternatives for this job?

Just searching for a "/Title" string is not enough because embedded objects may carry their own title information.

Thanks for your time,

Replies are listed 'Best First'.
Re: PDF GetInfo(
by svad (Pilgrim) on Apr 27, 2002 at 06:12 UTC
    When I used for my local needs, I reworked it a lot, because it is not only warnings and strict unsafe, but also it does not always parse things correctly.
    I did not succeeded polishing it to 100% working state yet.

    Another solution - if your machine have Acrobat (exchange) you can use Win32::OLE module. It is not enough to have Acrobat Reader installed in this case :(

    use strict; use Win32::OLE; use Win32::OLE::Const; my $abat = Win32::OLE->new('AcroExch.App'); my $abdoc = Win32::OLE->new('AcroExch.AVDoc'); $abdoc->Open('d:\Documentation\perl\xtk.pdf','d:\Documentation\perl\xt +k.pdf'); my $pddoc = $abdoc->GetPDDoc; print "pp=".$pddoc->GetNumPages,"\n"; for (qw(Title Subject Author)) { print "$_:".$pddoc->GetInfo($_),"\n"; }
    Acrobat SDK documentation have information about how automate it via OLE.

    Warmest wishes,

      line use Win32::OLE::Const; remains from my piece of code, it is not needed here

      Thanks for your answer, Vadim!

      I owe the full Acrobat program but for speed and portability reasons would prefer a pure Perl solution.

      pdflib is not under GNU, the other CPAN modules are as far as I've found more suited for PDF creation. I'd love to see corrected therefore. Let's see what the author says.

      Best wishes,
Re: PDF GetInfo(
by Arguile (Hermit) on Aug 16, 2002 at 18:01 UTC

    Rather late on the reply, but I was just doing something similar. I needed to generate an alphabetical index of publications (all PDF) and wanted it in a nice HTML table with other pertinant metadata (pages, creation date, etc.).

    What follows is just a bit of cleaned up PDF reading code:

    opendir DIR, '.'; my @pdf = sort { $a->{Title} cmp $b->{Title} } map { scalar( get_info($_) ) } grep { /\.pdf/i && -f $_ } readdir DIR; sub get_info { # Get basic PDF metadata. my $file = shift; my %info; my $pdf = PDF->new($file); return undef unless $pdf->IsaPDF; $info{Filename} = $file; $info{Size} = -s $file; $info{Pages} = $pdf->Pages; for (qw(Title CreationDate ModDate)) { $info{$_} = $pdf->GetInfo($_) } return( wantarray ? %info : \%info ); }

    That opens the current directory and results in an alphabetically sorted (if you wanted it sorted by different criteria just change the sort {} section) array of PDF metadata info hashes. $pdf->[0]{Title} is the title of the first PDF in the array.

    print "$_->{Title}\n" for @pdf;

    Will give you a plain text list of all the document titles. If you only want titles just delete all the hash stuff in the sub and return the title scalar only.

Re: PDF GetInfo(
by axelrose (Scribe) on Apr 29, 2002 at 20:42 UTC
    With the help of Martin Hosken, author of the Text::PDF modules I attacked the task like this:
    #!perl -w use strict; use Text::PDF::File; if (@ARGV) { for my $file (@ARGV) { if ( -r $file ) { print gettitle($file), "\n" } } } elsif ( $^O =~ /Mac/ ) { chomp( my $pwd = `pwd` ); my $file = MacPerl::Ask( "Input file:", $pwd ); if ( -r $file ) { print gettitle($file), "\n" } } else { die "no input, no output\n"; } sub gettitle { my $pdffile = shift; my $pdf = Text::PDF::File->open($pdffile) || die; my $info = $pdf->{'Info'}->val; my $title = $info->{'Title'}->val; }
    I will check if manually going through all lines of the PDF file will give a speed boost.
      With the help of Alan Fry I could manage to get a fast solution like this
      sub gettitle { use Fcntl; my $file = shift; local *IN; sysopen( IN, $file, O_RDONLY, 0 ) or die "while reading: '$file'\n"; read IN, my ($str), -s $file; close IN; my ($info_block) = ( $str =~ /\/Info\s(\d+)\s0\sR/ ) or die "cannot get /Info paragraph\n"; my $searchpos = -1; my $info_start; while (1) { $info_start = index( $str, "$info_block 0 obj", $searchpos + 1 ); die "cannot get position of '$info_block 0 obj'\n" if $info_start < $searchpos + 1; last if ( substr( $str, $info_start - 1, 1 ) =~ /\015|\012/ ); $searchpos = $info_start; } my $info_obj = substr( $str, $info_start, index( $str, ">>", $info_start ) - $info_start + 2 ); my ($title) = ( $info_obj =~ /\/Title\s*\( ([^\015\012|\015|\012]*) \) /x ) or return 'undefined'; return $title; }

      I furthermore compared the performance of the above solution with Text::PDF and PDF-111 from CPAN. The test set consisted of 36 PDF files summing up to 3.8 MB.

      runtime ratios of
      index-solution-from-above : Text::PDF methods : PDF-111
      1 : 6 : 12

      PDF-111 from CPAN has other flaws too. The author didn't respond to my questions. IMHO it should be dumped. It has a far to promiment place in the module hierarchy.

        I'll grant you PDF has some definite flaws but the above solution does as well unfortunately.

        It has trouble with titles that were truncated due to length – it returns them as undefined. There also seems to be some problems with asian languages.

        I like the speed and mem use compared to some of the others (about 4× faster than PDF->GetInfo). I need to muck about in the info section for some other stuff hopefully I'll figure out the format for long titles. Thanks.