Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??
Hi jcb,

You're absolutely correct. At the end of a PDF file, there's a size section indicated by /Size <nr_of_cos_objects> that informs you how many x 0 obj there are in the PDF file. COS = Carousel Object System and refers to the original code used by Adobe (not used anymore as such, though...).

Just before that section, there's a "table" that tells you how many bytes (= offset) there are from the beginning of the file to a certain block, like so:

endobj xref 0 7139 0000000000 65535 f 0000000015 00000 n 0000012681 00000 n 0000025600 00000 n 0000058867 00000 n 0050527288 00000 n 0000023513 00000 n 0000020738 00000 n 0000018831 00000 n 0000016437 00000 n 0000012809 00000 n 0050520688 00000 n 0050527008 00000 n 0000000484 00000 n
What you see here is the total amount of COS objects (7139 in this case, which is repeated later on within a separate section indicated with /Size, like I explained above), followed by the "table" that indicates the offset of every block, starting from 0 (this one can and should be ignored, since there's no COS starting with 0 itself).

So, object 1 0 obj is located at a 15 bytes offset position from the start of the file, 2 0 obj has an offset of 12681 bytes from the start of the file and so on. Well, you get the picture...

If I now change the content of the PDF (that is, removing such COS sections), then the table is not correct anymore. When you open such files, your PDF reader will (and should) complain that the xref table is not correct anymore (obviously).
Then you have 2 options:
1. Leave it as it is and save the file. I think a good PDF reader might resolve/recalculate the table for you, prior to saving the file. However, I'm not sure of that.
2. Recalculate the new table yourself before you share the document with someone else.

Since I don't want to bother that "someone else" with error messages when opening the modified PDF file, I chose for the latter solution: recalculating the table myself.

Hence, this is why I want to recalculate for each and every object in the PDF file its offset after modifying the content of the file.

Your remark of the binary stream having a byte sequence that matches the beginning of an object is correct, but therefor my regex forces this byte sequence to be found at the beginning of a line. And I know, there's still a chance that a binary stream might also start with such byte sequence, but don't you thing such chances are odd?

Anyway, when using Python, I can recalculate the table perfectly. Even with the binary content in the file, it works flawlessly.

One big disadvantage: it takes ages and ages before the recalculation is finished, especially when you have several thousands of COS items to recalculate. Knowing Perl is much better for text processing in any way, I think it will do the recalculation much faster. Hence, I chose to use Perl to help me doing the job...


Best rgds,
Geert

In reply to Re^2: Calculated position incorrect when using regex in text file that also contains binary info by geertvc
in thread Calculated position incorrect when using regex in text file that also contains binary info by geertvc

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others surveying the Monastery: (6)
As of 2024-04-23 17:06 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found