Detecting PDF content

by Rich36 (Chaplain)
I'm working on an application that's detecting file types and processing them based on what kind of file they are. I'm using File::MMagic which returns the MIME type.

Some of the pdf documents that I'm working with don't have any text - they're just images in a pdf document. Is there a module or some code that can be used to detect that? I've been looking through CPAN for anything, but nothing jumped out at me as a solution. I was hoping that there would be something in the header that indicated that, but I'm not sure if there is (the information captured from PDF::Parse didn't provide what I need).


Re: Detecting PDF content
by tall_man (Parson) on Jan 22, 2003 at 23:01 UTC
    You want to detect some of the low-level structure in a PDF file, right? Perhaps the script, which comes in the PDF CPAN package in the "examples" subdirectory would be of some help.

    Here is the start of what it dumps about the first page of The Perl Journal:

    % perl 0301tpj.pdf 1 Page 1 Dictionary << Name: /CropBox => Array [ Number: 0 Number: 0 Number: 558 Number: 756 ] Name: /MediaBox => Array [ Number: 0 Number: 0 Number: 558 Number: 756 ] Name: /Rotate => Number: 0 Other: Page_Object => Object: 402 0 R Other: Resource_Object => Object: 434 0 R >> ...
    You can probably find a distinct set of components for your image-only cases.

    Update: Mr. Muskrat and I seem to have different interpretations of your question. I read "detect that" to mean "detect that a file (which is already known to be a pdf file) contains only images rather than images plus text or text alone."

      I think that will definitely be helpful. I'll just need to figure out what the returned paramters are and try to figure out what constitutes an image.

Re: Detecting PDF content
by Mr. Muskrat (Canon) on Jan 22, 2003 at 23:05 UTC
    How about this code that I pulled from the docs for PDF?
    use PDF; my $pdf=PDF->new($filename); print "$filename is a PDF file\n" if ($pdf->IsaPDF);

    Added: If a PDF contains no text, is it still a PDF? Of course.

    Updated again: A PDF starts out with %PDF if it helps any...

    D'oh! I totally misunderstood your question... I apologize.

