Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

Re^2: Calculated position incorrect when using regex in text file that also contains binary info

by geertvc (Sexton)
on Jun 17, 2020 at 05:09 UTC ( #11118164=note: print w/replies, xml ) Need Help??


in reply to Re: Calculated position incorrect when using regex in text file that also contains binary info
in thread Calculated position incorrect when using regex in text file that also contains binary info

Hi jcb,

You're absolutely correct. At the end of a PDF file, there's a size section indicated by /Size <nr_of_cos_objects> that informs you how many x 0 obj there are in the PDF file. COS = Carousel Object System and refers to the original code used by Adobe (not used anymore as such, though...).

Just before that section, there's a "table" that tells you how many bytes (= offset) there are from the beginning of the file to a certain block, like so:

endobj xref 0 7139 0000000000 65535 f 0000000015 00000 n 0000012681 00000 n 0000025600 00000 n 0000058867 00000 n 0050527288 00000 n 0000023513 00000 n 0000020738 00000 n 0000018831 00000 n 0000016437 00000 n 0000012809 00000 n 0050520688 00000 n 0050527008 00000 n 0000000484 00000 n
What you see here is the total amount of COS objects (7139 in this case, which is repeated later on within a separate section indicated with /Size, like I explained above), followed by the "table" that indicates the offset of every block, starting from 0 (this one can and should be ignored, since there's no COS starting with 0 itself).

So, object 1 0 obj is located at a 15 bytes offset position from the start of the file, 2 0 obj has an offset of 12681 bytes from the start of the file and so on. Well, you get the picture...

If I now change the content of the PDF (that is, removing such COS sections), then the table is not correct anymore. When you open such files, your PDF reader will (and should) complain that the xref table is not correct anymore (obviously).
Then you have 2 options:
1. Leave it as it is and save the file. I think a good PDF reader might resolve/recalculate the table for you, prior to saving the file. However, I'm not sure of that.
2. Recalculate the new table yourself before you share the document with someone else.

Since I don't want to bother that "someone else" with error messages when opening the modified PDF file, I chose for the latter solution: recalculating the table myself.

Hence, this is why I want to recalculate for each and every object in the PDF file its offset after modifying the content of the file.

Your remark of the binary stream having a byte sequence that matches the beginning of an object is correct, but therefor my regex forces this byte sequence to be found at the beginning of a line. And I know, there's still a chance that a binary stream might also start with such byte sequence, but don't you thing such chances are odd?

Anyway, when using Python, I can recalculate the table perfectly. Even with the binary content in the file, it works flawlessly.

One big disadvantage: it takes ages and ages before the recalculation is finished, especially when you have several thousands of COS items to recalculate. Knowing Perl is much better for text processing in any way, I think it will do the recalculation much faster. Hence, I chose to use Perl to help me doing the job...


Best rgds,
Geert
  • Comment on Re^2: Calculated position incorrect when using regex in text file that also contains binary info
  • Select or Download Code

Replies are listed 'Best First'.
Re^3: Calculated position incorrect when using regex in text file that also contains binary info
by jcb (Parson) on Jun 17, 2020 at 23:40 UTC

    Have you considered reading the xref table before beginning your manipulations, using it to calculate the sizes of the objects, reading the objects as binary records (set $/ to a reference to a number or use read) using the xref information, and then simply calculating and writing a new xref table? That should be faster still than asking the regex engine to scan the entire contents of a PDF.

    therefor my regex forces this byte sequence to be found at the beginning of a line

    A binary stream can also contain an end-of-line sequence, especially if we consider maliciously crafted input.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11118164]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others about the Monastery: (3)
As of 2021-10-26 06:24 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    My first memorable Perl project was:







    Results (90 votes). Check out past polls.

    Notices?