http://www.perlmonks.org?node_id=485615


in reply to Determining a PDF's orientation

Here is a script which will tell you what you want. The script opens a Pdf and then extracts each page objects MediaBox information in the order it is stored in the pdf. The actual page order is determined by the the order of the Kids entries for the Pages and is left as an exercise for you.
use strict; my $fh; open $fh, "<out.pdf" or die "$!"; binmode( $fh ); my %page; my @kids; my @pages; my $root; my ( $line, $buf ); while( read($fh, $buf, 256) ){ $line .= $buf; my (@lines) = split /endobj/, $line; $line = $lines[-1]; foreach ( @lines ){ my $obj; my $pf = 0; my $cf = 0; my $parent = ""; my $rotation = 0; s/\r|\n/ /g; if ( /(\d+\s+\d+\s+obj)/ ){ $obj = $1; } if ( /Rotate\s+(\d+)/ ){ $rotation = $1; } if ( /Pages\s+/){ $pf = 1; } if ( /Parent\s+(\d+\s+\d+\s+R)/ ){ $parent = $1; } if ( /Catalog/ ){ $cf = 1; } if ( /Pages\s+(\d+\s+\d+\s+R)/ && $cf == 1 ){ push @kids, "1st Pages obj: $1"; } if ( /Kids\s+\[([^\]]+)\]/ && $pf ==1 ){ push @kids, "$obj", $1; } if ( /MediaBox\s+\[([^\]]+)\]/ ){ $page{$obj} = "$1 R:$rotation"; push @pages, $obj; } } } print "$_\n" for @kids; print "$_ $page{$_}\n" for @pages; __END__ Output: 1st Pages obj: 4 0 R <- root pages object 4 0 obj 5 0 R 8 0 R <- Kids of root pages (page 1 is obj 5 0 R ) 5 0 obj 0 0 612 792 R:0 <- ( 0 0 612 792 ) MediaBox 8 0 obj 0 0 612 792 R:0

The R:0 is the page view rotation ( a multiple of 90 ). The pages dimensions are defined in points so a portrait definition would have the following entry /MediaBox 0 0 612 792 I hope this is not overkill for you. I know a bit too much about this because I have written a Pdf module which I have not yet released to CPAN...
Hope it helps,
JamesNC

Replies are listed 'Best First'.
Re^2: Determining a PDF's orientation
by friedo (Prior) on Aug 22, 2005 at 07:13 UTC
    Thanks, JamesNC++! I really appreciate you taking the time to post that. You clearly have a lot of experience with PDFs. My condolences.

    Unfortunately your script is reporting R:0 for every PDF I try it on, even though some are clearly rotated. Perhaps I'm missing something. FWIW, these particular PDFs all have exactly one page (they came from various pdftk burst operations.)

    Is there a particular reason why you are only buffering 256 bytes? I increased the buffer size to 65536 as it was taking several minutes to run on some of my PDFs.

      That is a pretty big buffer size and I think 1024 would do just as well because it just gets chopped up anyway. R:0 means that the optional /Rotate entry is missing. If you know the page is in landscape mode, then the page MediaBox should match the page ie 0 0 792 612 would be 792 wide by 612 high ( 11 x 8 1/2 inches ). Now, if the text is rotated and it is not due to the /Rotate option in the viewer, then the reason is because the user space has been tranformed. In pdf speak:
      my $angle = 90; my $cos = sprintf("%.4f",cos( $angle*PI/180 )); my $sin = sprintf("%.4f",sin( $angle*PI/180 )); my $sin2 = -$sin; $rotate = "$cos $sin $sin2 $cos 0 0 cm ";
      the rotate scalar entry is applied to the contents of the page and then everything in that space still operates as if x and y are still the same.... but the page is just rotated.
      The page contents are often LZW or deflate encoded and you would have to decompress the contents to see if that were the case... kind of messy to do by hand if that is the case. I know how to do it... but that may not be the best solution for you. I would just try to infer from the page size and go from there. That is part of the challenge of trying to decode a Pdf... if the user space is tranlated or transformed then the consumer application needs to be aware of that. Perhaps you should send the author of that Pdf module a message to alert them about your problem, he might have overlooked something.
      Wish I could be more helpful.
      JamesNC