http://www.perlmonks.org?node_id=1103637

ateague has asked for the wisdom of the Perl Monks concerning the following question:

Good evening!

I am attempting to merge several thousand small (<20KiB) PDF file together into one PDF using PDF::API2.

The merging script works correctly at the beginning, but quickly bogs down after processing ~100 PDFs before grinding to a halt with the following warnings:

Deep recursion on subroutine "PDF::API2::Basic::PDF::Objind::release" +at C:/Perl64/site/lib/PDF/API2/Basic/PDF/Objind.pm line 123. Deep recursion on subroutine "PDF::API2::Basic::PDF::Objind::release" +at C:/Perl64/site/lib/PDF/API2/Basic/PDF/Objind.pm line 123. Deep recursion on subroutine "PDF::API2::Basic::PDF::Objind::release" +at C:/Perl64/site/lib/PDF/API2/Basic/PDF/Objind.pm line 123. ...

Here is a small sample merging script that exhibits the problem
(the script used to generate the sample PDF files can be found in the READMORE section):

#!/usr/bin/perl use 5.018; use PDF::API2; use strict; use warnings; my $path = "./pdfs/"; my $out_pdf_file = 'merged.pdf'; my $out_pdf; opendir (my $DIR, $path) or die "Could not open $path:\n$!\n$^E"; chdir $path; while ( my $in_pdf_file = readdir $DIR ) { next if $in_pdf_file =~ /^\./; my $in_pdf = PDF::API2->open($in_pdf_file) or die "Error opening PDF + file [$in_pdf_file]:\n$!\n$^E"; # Append to output if PDF already exists if ( -e $out_pdf_file ) { $out_pdf = PDF::API2->open($out_pdf_file); } # Create new PDF if output does not exist else { $out_pdf = PDF::API2->new(-file => $out_pdf_file); } foreach my $page ( 1 .. $in_pdf->pages() ) { $out_pdf->import_page($in_pdf, $page, 0); } $out_pdf->update(); } closedir $DIR;

I cannot help but feel I am going about this merging process in a horribly inefficient way (Schlemiel the painter anyone?). Does anyone have any pointers toward a correct/more efficient merging technique?

Thank you for your time.

perl -v This is perl 5, version 18, subversion 2 (v5.18.2) built for MSWin32-x +64-multi-thread (with 1 registered patch, see perl -V for more detail) Copyright 1987-2013, Larry Wall Binary build 1802 [298023] provided by ActiveState http://www.ActiveSt +ate.com Built Apr 15 2014 10:38:37
perl -MPDF::API2 -E "say $PDF::API2::VERSION;" 2.023

Script used to generate the sample PDF files:

Replies are listed 'Best First'.
Re: Problem merging thousands of PDFs with PDF::API2: 'Deep recursion on subroutine "PDF::API2::Basic::PDF::Objind::release"'
by zwon (Abbot) on Oct 13, 2014 at 16:33 UTC
    You're re-opening $out_pdf for every new input file, that may be one of the reasons. Try to open output pdf file just once before the while loop.
      You're re-opening $out_pdf for every new input file, that may be one of the reasons. Try to open output pdf file just once before the while loop.

      That certainly is a problem. However, according to the PDF::API2 docs for the update method, $out_pdf is removed from memory after the first iteration of the loop after writing out to the merged pdf.

      $pdf->update() Saves a previously opened document. <...> $pdf->end() Remove the object structure from memory. PDF::API2 contains circul +ar references, so this call is necessary in long-running processes to + keep from running out of memory. This will be called automatically when you save or stringify a PDF +.

      I get the following error when I modify the script to open $out_pdf before the loop:

      Can't call method "new_obj" on an undefined value at C:/Perl64/site/lib/PDF/API2/Basic/PDF/Pages.pm line 92.

      The new, updated script follows:

      #!/usr/bin/perl use 5.018; use PDF::API2; use strict; use warnings; my $path = "./pdfs/"; my $out_pdf_file = 'merged.pdf'; my $out_pdf = PDF::API2->new(-file => $out_pdf_file); opendir (my $DIR, $path) or die "Could not open $path:\n$!\n$^E"; chdir $path; while ( my $in_pdf_file = readdir $DIR ) { next if $in_pdf_file =~ /^\./; my $in_pdf = PDF::API2->open($in_pdf_file) or die "Error opening PDF + file [$in_pdf_file]:\n$!\n$^E"; foreach my $page ( 1 .. $in_pdf->pages() ) { $out_pdf->import_page($in_pdf, $page, 0); } $out_pdf->update(); } closedir $DIR;

      Moving $out_pdf->update(); out of the loop fixes the undefined value error, but the script quickly exhausts all the memory on the computer.

        If you don't need any PDF::API2 specific features, then perhaps you can try something else? For example CAM::PDF. It comes with appendpdf.pl script which does something similar to what you need.
Re: Problem merging thousands of PDFs with PDF::API2: 'Deep recursion on subroutine "PDF::API2::Basic::PDF::Objind::release"'
by MidLifeXis (Monsignor) on Oct 13, 2014 at 18:10 UTC

    IIRC (but this was a while ago, so my rememberer may not be correct or out of date), PDF::API2 wraps the old + new documents into a new PDF::API2 document container. Therefore, if you are repeatedly building a new document this way, you end up with a structure that looks like...

                               p1-pN
                              /     \
                      p1-p(N-1)      pN
                     /        \
             p1-p(N-2)         p(N-1)
            /        \
    p1-p(N-3)         p(N-2)
    

    ... and so on. Once this reaches a few hundred pages, you have a very imbalanced tree, which can be inefficient to process. Also, since (I would guess - I have not recently checked the source) the PDF traversal code probably uses recursion, that could generate your deep recursion message.

    You could manually build a plan for a more balanced tree, and then build the final PDF file from that plan. Essentially you want to end up with the shortest binary tree that you can get for the number of original documents you have. For example, if you have 4 documents, you would merge 1+2 => A and 3+4 => B, and then merge A+B => C. For 8, you would do 1+2 => A, 3+4 => B, 5+6 => C, 7+8 => D; then A+B => E, and C+D => F; then E+F => G.

    If this is the case (see the first paragraph, and look at the resulting PDF file structure after merging a couple of documents), then a 'correct' (but possibly destructive) fix would be to rebalance the pages as new ones are inserted.

    As always, corrections welcome.

    Update: Cleaned up graphic

    --MidLifeXis

      a 'correct' (but possibly destructive) fix would be to rebalance the pages as new ones are inserted.

      Just for clarification, when you say "rebalance the pages", are you referring to a process wherein the script recursively processes and merges the PDF files in "batches" of say 16 files a piece?

      e.g.:

      LEVEL 1 (4_096 files): [pdf_level1] [pdf_level1] [pdf_level1] ... [pdf +_level1] LEVEL 2 ( 256 files): [[pdf_level1] * 16] [[pdf_level1] * 16] [[pdf_l +evel1] * 16] ... [[pdf_level1] * 16] LEVEL 3 ( 16 files): [[pdf_level2] * 16] [[pdf_level2] * 16] [[pdf_l +evel2] * 16] ... [[pdf_level2] * 16] LEVEL 4 ( 1 file): [[pdf_level3] * 16]

        Almost, but not quite. More of a Balanced Tree algorithm.

        --MidLifeXis

      Thanks. That helped my issues.
Re: Problem merging thousands of PDFs with PDF::API2: 'Deep recursion on subroutine "PDF::API2::Basic::PDF::Objind::release"'
by Laurent_R (Canon) on Oct 13, 2014 at 17:49 UTC
    The Deep recursion on subroutine ... is a warning, not an error. It may be an important symptom that something may be going wrong, but this in itself will not stop your program from continuing to run. This warning is displayed when you recurse more than 100 levels. Consider this one-liner implementation of the factorial function:
    $ perl -wE 'say fact(102); sub fact {my $c = shift; return 1 if $c == +1; return $c *fact($c-1)}' Deep recursion on subroutine "main::fact" at -e line 1. 9.61446671503512e+161
    We have the warning, but the program continued and printed the result.

    But you can silence that warning if you know well enough what you are doing, using the no warnings "recursion"; pragma. For example:

    $ perl -wE 'say fact(102); sub fact {my $c = shift; return 1 if $c == +1; no warnings "recursion"; return $c *fact($c-1)}' 9.61446671503512e+161
    If your program stops working, it may be because your program recurses really too much, but you should presumably have another message, an actual error message (instead of a warning), telling you why it stopped (out of memory or something).