|We don't bite newbies here... much|
Re^2: Calculated position incorrect when using regex in text file that also contains binary infoby geertvc (Sexton)
|on Jun 17, 2020 at 05:09 UTC||Need Help??|
You're absolutely correct. At the end of a PDF file, there's a size section indicated by /Size <nr_of_cos_objects> that informs you how many x 0 obj there are in the PDF file. COS = Carousel Object System and refers to the original code used by Adobe (not used anymore as such, though...).
What you see here is the total amount of COS objects (7139 in this case, which is repeated later on within a separate section indicated with /Size, like I explained above), followed by the "table" that indicates the offset of every block, starting from 0 (this one can and should be ignored, since there's no COS starting with 0 itself).
So, object 1 0 obj is located at a 15 bytes offset position from the start of the file, 2 0 obj has an offset of 12681 bytes from the start of the file and so on. Well, you get the picture...
If I now change the content of the PDF (that is, removing such COS sections), then the table is not correct anymore. When you open such files, your PDF reader will (and should) complain that the xref table is not correct anymore (obviously).
Then you have 2 options:
1. Leave it as it is and save the file. I think a good PDF reader might resolve/recalculate the table for you, prior to saving the file. However, I'm not sure of that.
2. Recalculate the new table yourself before you share the document with someone else.
Since I don't want to bother that "someone else" with error messages when opening the modified PDF file, I chose for the latter solution: recalculating the table myself.
Hence, this is why I want to recalculate for each and every object in the PDF file its offset after modifying the content of the file.
Your remark of the binary stream having a byte sequence that matches the beginning of an object is correct, but therefor my regex forces this byte sequence to be found at the beginning of a line. And I know, there's still a chance that a binary stream might also start with such byte sequence, but don't you thing such chances are odd?
Anyway, when using Python, I can recalculate the table perfectly. Even with the binary content in the file, it works flawlessly.
One big disadvantage: it takes ages and ages before the recalculation is finished, especially when you have several thousands of COS items to recalculate. Knowing Perl is much better for text processing in any way, I think it will do the recalculation much faster. Hence, I chose to use Perl to help me doing the job...