Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine
 
PerlMonks  

Determining a PDF's orientation

by friedo (Prior)
on Aug 21, 2005 at 00:33 UTC ( #485453=perlquestion: print w/replies, xml ) Need Help??

friedo has asked for the wisdom of the Perl Monks concerning the following question:

I am using PDF::Reuse to munge some PDFs. I have the dimensions of the PDFs in question stored in a database, so when I create a new one, I can use the rMbox function to set the right size. Unfortunately, some of my source PDFs appear to be in landscape mode, so the X and Y coordinates are essentially reversed. However, the coordinates, which come from pdfinfo, are always in X,Y order regardless of orientation. This results in the munged PDFs being cut off as the content ends up outside the page boundaries.

I also tried using Image::Magick to get the size of each PDF, but the dimensions are reported in the same order as pdfinfo so that's no help. If anyone knows an easy and reliable way to determine the orientation of a PDF (or to simply get the dimensions in the right order in the first place) I would be very grateful.

Thanks.

Replies are listed 'Best First'.
Re: Determining a PDF's orientation
by Roger (Parson) on Aug 21, 2005 at 00:41 UTC
    Have you tried the PDF suite to determine the page orientation?
    use PDF::Parse; use Data::Dumper; my $pdf->TargetFile($filename); $pdf->LoadPageInfo; my @size = $pdf->PageSize; my $rotation = $pdf->PageRotation; print Dumper(\@size); print Dumper($rotation);
      Unfortunately, whenever I attempt to use PDF, I get the error "Bad object reference '>'", (an error which I have never seen before, anywhere.) As with everything else involving PDFs, there aren't any useful docs for solving this problem. It seems to me the entire purpose of this hell-spawned document format is for pissing off programmers. Oh well. I appreciate the help, though.
Re: Determining a PDF's orientation
by JamesNC (Chaplain) on Aug 22, 2005 at 05:45 UTC
    Here is a script which will tell you what you want. The script opens a Pdf and then extracts each page objects MediaBox information in the order it is stored in the pdf. The actual page order is determined by the the order of the Kids entries for the Pages and is left as an exercise for you.
    use strict; my $fh; open $fh, "<out.pdf" or die "$!"; binmode( $fh ); my %page; my @kids; my @pages; my $root; my ( $line, $buf ); while( read($fh, $buf, 256) ){ $line .= $buf; my (@lines) = split /endobj/, $line; $line = $lines[-1]; foreach ( @lines ){ my $obj; my $pf = 0; my $cf = 0; my $parent = ""; my $rotation = 0; s/\r|\n/ /g; if ( /(\d+\s+\d+\s+obj)/ ){ $obj = $1; } if ( /Rotate\s+(\d+)/ ){ $rotation = $1; } if ( /Pages\s+/){ $pf = 1; } if ( /Parent\s+(\d+\s+\d+\s+R)/ ){ $parent = $1; } if ( /Catalog/ ){ $cf = 1; } if ( /Pages\s+(\d+\s+\d+\s+R)/ && $cf == 1 ){ push @kids, "1st Pages obj: $1"; } if ( /Kids\s+\[([^\]]+)\]/ && $pf ==1 ){ push @kids, "$obj", $1; } if ( /MediaBox\s+\[([^\]]+)\]/ ){ $page{$obj} = "$1 R:$rotation"; push @pages, $obj; } } } print "$_\n" for @kids; print "$_ $page{$_}\n" for @pages; __END__ Output: 1st Pages obj: 4 0 R <- root pages object 4 0 obj 5 0 R 8 0 R <- Kids of root pages (page 1 is obj 5 0 R ) 5 0 obj 0 0 612 792 R:0 <- ( 0 0 612 792 ) MediaBox 8 0 obj 0 0 612 792 R:0

    The R:0 is the page view rotation ( a multiple of 90 ). The pages dimensions are defined in points so a portrait definition would have the following entry /MediaBox 0 0 612 792 I hope this is not overkill for you. I know a bit too much about this because I have written a Pdf module which I have not yet released to CPAN...
    Hope it helps,
    JamesNC
      Thanks, JamesNC++! I really appreciate you taking the time to post that. You clearly have a lot of experience with PDFs. My condolences.

      Unfortunately your script is reporting R:0 for every PDF I try it on, even though some are clearly rotated. Perhaps I'm missing something. FWIW, these particular PDFs all have exactly one page (they came from various pdftk burst operations.)

      Is there a particular reason why you are only buffering 256 bytes? I increased the buffer size to 65536 as it was taking several minutes to run on some of my PDFs.

        That is a pretty big buffer size and I think 1024 would do just as well because it just gets chopped up anyway. R:0 means that the optional /Rotate entry is missing. If you know the page is in landscape mode, then the page MediaBox should match the page ie 0 0 792 612 would be 792 wide by 612 high ( 11 x 8 1/2 inches ). Now, if the text is rotated and it is not due to the /Rotate option in the viewer, then the reason is because the user space has been tranformed. In pdf speak:
        my $angle = 90; my $cos = sprintf("%.4f",cos( $angle*PI/180 )); my $sin = sprintf("%.4f",sin( $angle*PI/180 )); my $sin2 = -$sin; $rotate = "$cos $sin $sin2 $cos 0 0 cm ";
        the rotate scalar entry is applied to the contents of the page and then everything in that space still operates as if x and y are still the same.... but the page is just rotated.
        The page contents are often LZW or deflate encoded and you would have to decompress the contents to see if that were the case... kind of messy to do by hand if that is the case. I know how to do it... but that may not be the best solution for you. I would just try to infer from the page size and go from there. That is part of the challenge of trying to decode a Pdf... if the user space is tranlated or transformed then the consumer application needs to be aware of that. Perhaps you should send the author of that Pdf module a message to alert them about your problem, he might have overlooked something.
        Wish I could be more helpful.
        JamesNC
Re: Determining a PDF's orientation
by calin (Deacon) on Aug 21, 2005 at 18:14 UTC

    I know this is an unsophisticated solution, but you can take advantage of the fact that usually in portrait documents the "x" dimension is smaller than "y" ; in landscape documents the reverse is true.

    So by comparing "x" with "y" you can get a rough estimate of whether the document is in portrait or landscape mode.

    Hope I understand your problem correctly.

      That is one thing I have considiered, though it is unreliable, unfortunately. The PDFs in question come in all sorts of bizarre sizes and some are bound to be wider than they are tall, but still in "portrait" mode according to the PDF. Thanks for the suggestion, though.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://485453]
Approved by Roger
Front-paged by tlm
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others surveying the Monastery: (3)
As of 2021-05-09 20:36 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Perl 7 will be out ...





    Results (102 votes). Check out past polls.

    Notices?