Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

Macintosh PDF's on Windows

by HamNRye (Monk)
on Mar 01, 2007 at 17:51 UTC ( #602732=perlquestion: print w/ replies, xml ) Need Help??
HamNRye has asked for the wisdom of the Perl Monks concerning the following question:

Arrgh...

I have a program that crops PDF pages. Basically, we scan the file for the MediaBox[] area, and can enter our crop values as ASCII text. It all works great.

However, we came across a PDF we can't handle. The reason, it's from a mac. The line ends appear to be mostly "0D" as opposed to "0A". So, when I open the file and resave it, even in binmode, the file is mangled.

I'll try to explain this with as little BS as possible...

A sample of the two files:

Mac: %PDF-1.3 %‚ ====================== 25 50 44 46 2D 31 2E 33 0D 25 E2 PC: %PDF-1.3 %‚ ======================= 25 50 44 46 2D 31 2E 33 0A 25 E2

I have tried the usual "convert mac to dos" line ends and it screws up the xref table. Anyone have any experience with MAC PDF's and manipulating them on Windows?

perl -p -e 's/(\r\n|\n|\r)/\r/g'   inputfile > outputfile

Comment on Macintosh PDF's on Windows
Select or Download Code
Re: Macintosh PDF's on Windows
by traveler (Parson) on Mar 01, 2007 at 18:07 UTC
    I do not know what modules (if any) you are using to process the pdf. If you do the file opening, you might try:
    use PerlIO::eol; #... open (FILE, "mypdf.pdf") or die("Can't open mypdf.pdf"); binmode FILE, ":raw:eol(LF)";
Re: Macintosh PDF's on Windows
by ikkon (Monk) on Mar 01, 2007 at 18:12 UTC
    I create PDF's using PDF::API2 and it works fine on mac and pc (I create it on a mac) but if worse comes to worse you could search and replace OD with OA
Re: Macintosh PDF's on Windows
by Moron (Curate) on Mar 01, 2007 at 18:19 UTC
    (oops lol - just realised that the "perl -p -e ... " at the end of the OP is an effort to solve - first impression was it was just a monk signature)

    ... try to match as closely as you can to the substring you are trying to change - use /...$/ to mark the end of the line in your regex - match line by line and if there are variations, if necessary, be prepared to use multiple regexes for them.

    See also perlre

    -M

    Free your mind

Re: Macintosh PDF's on Windows
by almut (Canon) on Mar 01, 2007 at 18:28 UTC

    Yes, you sure don't want to change the 0D into anything else... Apart from the xref table, there are typically also zipped (i.e. binary) streams, which also could contain 0D bytes.

    As long as you open input and output filehandles without automatic line ending translation, you should be fine, though. And, if you need to process the file in a line-based fashion, just set the input record separator $/ to "\r" (=0D):

    $/ = "\r"; open my $in_fh, "<:raw", "Mac_in.pdf" or die "$!"; open my $out_fh, ">:raw", "Mac_out.pdf" or die "$!"; while (<$in_fh>) { # do something with the line... print $out_fh $_; }

    (also make sure to never change the size (number of bytes) of any objects in the PDF, or else the offsets given in the xref table will become incorrect... but you probably knew that already)

Re: Macintosh PDF's on Windows
by mr_mischief (Prior) on Mar 01, 2007 at 21:13 UTC
    For one, that's not a Mac to DOS line ending conversion. That's a DOS/Unix/Mac to Mac line ending conversion AFAICT. (You're also doing a capture that you're not using.) Mac to DOS would be s/\r/\r\n/ IIRC. However, most PDF files contain binary data, so that's probably not the best route.

    You probably should preserve the byte integrity of the file, especially if the file contains any binary data. From the PDF 1.6 spec, page 25 about Lexical Conventions:
    A PDF file containing binary data must be transported and stored by means that preserve all bytes of the file faithfully; that is, as a binary file rather than a text file. Such a file is not portable to environments that impose reserved character codes, maximum line lengths, end-of-line conventions, or other restrictions.

    The carriage return or linefeed either one or both together is an acceptable line ending according to the spec. Your software or the libraries you use would be wise to stick to the spec. From the 1.6 spec page 26:
    The carriage return (CR) and line feed (LF) characters, also called newline characters, are treated as end-of-line (EOL) markers. The combination of a carriage return followed immediately by a line feed is treated as one EOL marker. For the most part, EOL markers are treated the same as any other white-space characters. However, sometimes an EOL marker is required or recommendedóthat is, the following token must appear at the beginning of a line.

    The secret to your success, it seems, is in not trusting your friendly neighborhood OS to handle EOL for you. Open source and destination both binmode, use read() or sysread(), and determine line endings for yourself.

    The PDF specifications are available in PDF format from Adobe for free download. You can get from 1.3 to 1.7 specs here. The full spec is cumbersome, but PDF::API2 and PDF::API2::Simple among others have already been built if you don't want to mess with it yourself. I haven't played with moving PDFs around too much, but the ones I generate using PDF::API2 and PDF::API2::Simple on Linux work great on Windows, and those have differing text-file line endings.


    Christopher E. Stith
Re: Macintosh PDF's on Windows
by superfrink (Curate) on Mar 01, 2007 at 23:16 UTC
    I don't know if PDFs do this but one time a OSX user gave me a quicktime file that didn't work.

    It turns out that the codec was part of the mac file "resource" and stored in the filesystem rather than in the file. When the file was sent over FTP the codec didn't come along with it.

    It turned out that the "cp" command line program on OSX did not copy the resource data either. In order for the resource data to be copied the file had to be copied using the GUI.

    I copied the file using "cp" and compared it using "cmp" and "diff" in addition to "md5" (or "md5sum"). Neither of these commands showed any difference in the file but the copy would not play on the same Mac that the original file played on.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://602732]
Approved by friedo
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others browsing the Monastery: (12)
As of 2014-07-28 13:02 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (197 votes), past polls