Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

Editing/Replacing Text in a PDF

by ikkon (Monk)
on Dec 20, 2006 at 14:24 UTC ( #590910=perlquestion: print w/ replies, xml ) Need Help??
ikkon has asked for the wisdom of the Perl Monks concerning the following question:

ok I will try to be as detailed as possible, I have to create a PDF from variables posted from Flash. getting the variables is easy, and putting them in a PDF however was challenging when first started is now for the most part under control. the tricky part it, the layout for the pdf is designed using Adobe Indesign, I use perl to open the pdf and to place variables in the template (PDF::API2), but the designer makes the text in Indesign and wants me to replace his text holders with the variables posted from flash. Right now I have been trying to just replace text, and nothing works this is the code i have thus far.
#!/usr/bin/perl use PDF::API2; use GD::Graph::bars; use GD::Graph; use GD; use strict; use constant mm => 25.4/72; use constant in => 1/72; use constant pt => 1; $pdf = PDF::API2->open('InDesignTemp.pdf'); my @data = ( ["Jan-01","Feb-01","Mar-01"], [21,25,33] ); my $graph = new GD::Graph::bars; $graph->set( x_label => 'Month', y_label => 'Revenue ($1,000s)', title => 'Monthly Online Sales for 2001', bar_spacing => 10 ) or warn $graph->error; $graph->plot(\@data) or die $graph->error; my %font = ( Helvetica => { Bold => $pdf->corefont('Helvetica-Bold', -encoding => 'lat +in1'), Roman => $pdf->corefont('Helvetica', -encoding => 'lati +n1'), Italic => $pdf->corefont('Helvetica-Oblique', -encoding => 'lat +in1'), }, Times => { Bold => $pdf->corefont('Times-Bold', -encoding => 'lati +n1'), Roman => $pdf->corefont('Times', -encoding => 'la +tin1'), Italic => $pdf->corefont('Times-Italic', -encoding => 'lat +in1'), }, ); my $page = $pdf->openpage(1); s/Title/Ben is the Man/; my $headline_text = $page->text; $headline_text->font( $font{'Helvetica'}{'Bold'}, 18/pt ); $headline_text->fillcolor( 'white' ); $headline_text->translate( 50/mm, 130/mm ); $headline_text->text_right( 'this is my new stuff' ); my $photo = $page->gfx; open(GRAPH,">images/graph1.jpeg") || die "Cannot open graph1.jpg: $!\n +"; binmode GRAPH; print GRAPH $graph->gd->jpeg(100); close GRAPH; $photo_file=$pdf->image_jpeg("images/graph1.jpeg"); $photo->image($photo_file, 60/mm, 70/mm, 85/mm, 75/mm); $pdf->saveas("Build3.pdf"); ###################################################################### +################### ####this part I was using to to replace the text however it corrupts t +he file after i do ###################################################################### +#################### #open(PDF,"<Build3.pdf") || print header('text/html').'can\'t open PDF + file'; #binmode PDF; #undef $/; #$_ = <PDF>; #close(PDF); #s/Title/Ben is the Man/; #open(OUT, '>out.pdf') || print header('text/html').'can\'t open out f +ile'; #print OUT $_; #close(OUT);

Comment on Editing/Replacing Text in a PDF
Download Code
Re: Editing/Replacing Text in a PDF
by SFLEX (Chaplain) on Dec 20, 2006 at 14:44 UTC
    I believe this node can help you with creating and reading a pdf file.
    PDF via Perl?
      thanks , but this doesn't have what i need, I can create pdf's just find and I find the PDF::API2 better for what i am attempting, i need to replace text inside of a pdf file, this post doesn't help in that area but thank for replying
Re: Editing/Replacing Text in a PDF
by toma (Vicar) on Dec 20, 2006 at 15:40 UTC
    The PDF contains various checksums that make it difficult to replace text this way, unless you use code that recomputes the checksum and puts them into the PDF. It also positions the text based on the width, so you might have best luck with center justified text, for example.

    In general, it is also possible that the PDF contains compressed text.

    You also forgot binmode on the OUT file handle.

    The crude way to do what you want is to convert the PDF to Postscript, substitute the text, and convert it back. I use the ImageMagick program 'convert' to do this.

    `convert Build3.pdf Build3.ps` ... Postscript filtering code goes here ... `convert Build3.ps out.pdf`
    There are also perl modules available to wrap the ImageMagick functionality. I believe that ImageMagick is in turn calling Ghostscript to handle the conversion, so you could use that directly, instead.

    It should work perfectly the first time! - toma
      thanks, this is very usefull info i will check this out and let you know what i figure out, thanks again.
Re: Editing/Replacing Text in a PDF
by moklevat (Priest) on Dec 20, 2006 at 16:30 UTC
Re: Editing/Replacing Text in a PDF
by Russ (Deacon) on Dec 20, 2006 at 18:20 UTC
    PDFs are relatively simple beasts, but what you're trying to do is corrupting the xref tables, which hold byte offsets to the beginnings of PDF objects. I haven't seen the output of Indesign, but the fact that you are able to replace text directly suggests that Indesign does not compress all its objects. That's good for your purposes.

    You'll want the PDF Reference (published by Adobe) to understand this better, but look at the xref table in the output file. It looks like this:

    xref 69 16 0000000016 00000 n 0000001041 00000 n 0000000616 00000 n 0000001121 00000 n 0000001250 00000 n 0000001381 00000 n 0000001533 00000 n 0000001567 00000 n 0000001782 00000 n 0000001858 00000 n 0000002238 00000 n 0000002609 00000 n 0000002830 00000 n 0000003269 00000 n 0000003514 00000 n 0000006183 00000 n
    You'll want to know which objects you are modifying so you can correct all the objects with higher offset values. An object starts with a section like this:
    71 0 obj
    This is the 71st object (note that the xref table I copied started its numbering at 69, for some reason), 0th revision. It starts after 616 bytes, which you can see in the xref table above.

    At a minimum, if all else goes well, you will have to correct the byte offstes of the objects that appear after your changes, so that the viewer can "find" them.

    If you're looking for an easy fix, this won't help. If you're willing to invest some time learning a really cool file format, jump on in!

    P.S. (Pun kinda intended) There may be more than one xref table. For your purposes, update all of them with offsets greater than your byte location until you know why you don't have to... :-)

      nice, thanks for a detailed answer, as far as this moment I will probably go with almuts suggustion , cause I do not have alot of time on this, however I am really interested in learning this, and since I am going on vacation soon I will have to really read up on this, and hopefully later be able to design a script that will interact more how i would like it too again thanks for the answer it really helped.
Re: Editing/Replacing Text in a PDF
by almut (Canon) on Dec 20, 2006 at 19:56 UTC

    As suggested by others, CAM::PDF is probably the better option for what you want to do (you should be able to do it with PDF::API2, too, but it's likely somewhat more involved...). CAM::PDF lets you easily edit content streams and stuff, and, most importantly, takes care of everything you don't want to do manually, like adjusting the object crossreference table after having modified an object's size (as explained by Russ), (un)compressing the streams, etc.

    Let's say you have the text "some placeholder text" (among other things) written on page 1 at some position, and you want to replace that text. In that case you could try:

    use CAM::PDF; my $pdf = CAM::PDF->new('Test1.pdf'); # existing document my $page = $pdf->getPageContent(1); # $page now holds the uncompressed page content as a string # replace the text part $page =~ s/some placeholder text/my new text/; $pdf->setPageContent(1, $page); $pdf->cleanoutput('Test2.pdf');

    The page content you get is in raw PDF syntax, i.e. various PDF operators like BT, ET, Tf, Tm, TJ, etc. together with their parameters. Consult Adobe's PDF Reference Document for what they do, their syntax, and so on.

    Literal text strings are written in parentheses (like in PostScript), so that's what you have to look for. Things may be complicated somewhat by the fact that some PDF generators are splitting up text in order to position individual substrings, to achieve custom word/letter spacing (i.e. differing from the font's defaults), kerning, etc... In the worst case, you'll find characters individually wrapped in parentheses.
    Well, just give it a try... it might not be as bad after all.

    Good luck!

      I don't think this is the first time you have helped me out alot, I appreciate your patience with me, this deffently put me on track where i wanted to be, but i think i will eventually learn more about what Russ was talking about again thanks i do appreciate it.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://590910]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others cooling their heels in the Monastery: (13)
As of 2014-09-18 17:21 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (120 votes), past polls