http://www.perlmonks.org?node_id=11118037

geertvc has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I have an uncompressed PDF file. I'm trying to find all locations/positions that contain the following format: <digit> 0 obj at the beginning of a line. E.g.: 3 0 obj.
Once I found an occurrence, I want to know how much bytes are "occupied" from the beginning of the file until that position found.

The file also contains, next to regular ASCII characters, binary data. A snippet is shown here below:

/CapHeight 667 >> endobj 1 0 obj << /Subtype /Type1C /Length 1194 2 0 obj >> stream 3 0 obj   BXTBMO+URWTypewriterTOT-LigNar % N% Lm /> 7D(URW)++,Copyright 2003 by (URW)++ Design & Developmen +t/FSType 4 def - B F H J O S    r SOb‹ո +K{zw }}}›{p{|}›}}}X{|}› +M}g4z|k}΃rm\\H +^f|oaviZb{~›ȔvtuJ‹Y<Z{hYso^P4 +UἬڬ.!e~L~u^r^sPDaį }VANr› +oH+8..-/›}wH{Ue?T^lrz +|pl^dI# o.uovat_wJ>Y{{~~zujvo~ۻ  +|3{zI\\sO8=1\'#-¹-|l +gfAGfsken^uZ+J ‹K.Lz4|~ +}}}e|~6|k}luyyuuxyt‹ +o/Lz3|~}}}d|~= +|e}iٵ_9}}e|~=|d}S +%>TfIu ‹o/Lz3|~}}}d|~ +M|U}\\ۥº0‹‹›}rec{rqyy{i +wdK Ò \\ endstream endobj 4 0 obj << /Length 422 >>

To find the locations in the above code snippet, I'm using a regex like so: qr/^\d+ 0 obj/m.

This is the test code I'm using ($pdf contains the string and \d is replace with a fixed number as a test):

my $result = $pdf=~qr/^1 0 obj/m; say "Finding first item at start position [$-[0]]" if $result; say "Finding first item at start position [$+[0]]\n" if $result; $result = $pdf=~qr/^2 0 obj/m; say "Finding second item at start position [$-[0]]" if $result; say "Finding second item at start position [$+[0]]\n" if $result; $result = $pdf=~qr/^3 0 obj/m; say "Finding third item at start position [$-[0]]" if $result; say "Finding third item at start position [$+[0]]\n" if $result; $result = $pdf=~qr/^4 0 obj/m; say "Finding fourth item at start position [$-[0]]" if $result; say "Finding fourth item at start position [$+[0]]\n" if $result;
This results in the following output:
Finding first item at start position [26] Finding first item at start position [33] Finding second item at start position [68] Finding second item at start position [75] Finding third item at start position [87] Finding third item at start position [94] Finding fourth item at start position [2035] Finding fourth item at start position [2042]

The first 3 results are correct, the 4th result is not. Reason: the first 3 results were found in "pure ASCII" text while the 4th match (4 0 obj) is in a section that's behind the section that contains the binary content. The result should be 1315 and 1322 instead of 2035 and 2042.

I'm afraid there's a problem with my regex when the text file also contains binary data.

In Python, you can tell the regex to "behave" binary by adding a b in front of the regex, like so: re.search(b'^\d+ 0 obj'... and that works perfect. The locations found are spot on.

How can this be done in Perl?

Best,
Geert
  • Comment on Calculated position incorrect when using regex in text file that also contains binary info
  • Select or Download Code

Replies are listed 'Best First'.
Re: Calculated position incorrect when using regex in text file that also contains binary info (updated)
by haukex (Bishop) on Jun 14, 2020 at 09:11 UTC

    Just two quick thoughts for now: open the file with the mode '<:raw' (or binmode the handle after opening but before reading), and add the /aa regex modifier (perlre) to your regexen.

    Update: I didn't have enough time earlier to explain why I made these two suggestions, so let me do that now. First, note that on Windows, the :crlf PerlIO layer is active by default, translating CRLFs to LFs, which isn't good for binary data. Also, opens without explicit layers can be affected by the open pragma. Second, if you were to open the file with an encoding layer, under Unicode matching rules (see /u in perlre), \d matches any Unicode digits (see "Digits" in perlrecharclass).

    Update 2: Typo fix

      Hello haukex,

      Sorry for the late reply, but I was for a few days not able to work on that one.

      I've followed your advice, opened the file in raw mode. Before, I was using the module use Path::Tiny; and then path('<file.pdf>')->slurp_raw; to open the file in raw mode. I guess that's the same behaviour?

      Anyway, I followed your advice and came up with the following test program:

      use v5.010; use strict; use warnings; # Reading in file in raw format. local $/; open F, "<:raw", "input.pdf" or die $!; my $raw_content = <F>; my $nr_of_cos_objects = 10; my @counter = (1..$nr_of_cos_objects); my $position = 0; for my $number (@counter) { my $result = $raw_content=~qr/^${number} 0 obj/aa; if ($result) { say "Object item [$number] ('${number} 0 obj') starts at posit +ion [$-[0]]"; } else { say "Object item [$number] ('${number} 0 obj') start position +not found"; } if ($result) { say "Object item [$number] ('${number} 0 obj') ends at posit +ion [$+[0]]\n"; } else { say "Object item [$number] ('${number} 0 obj') end position +not found\n"; } } say "End test program. Bye...";
      Result: nothing is found at all when using /aa (or /a). This is the output:
      Object item [1] ('1 0 obj') start position not found Object item [1] ('1 0 obj') end position not found Object item [2] ('2 0 obj') start position not found Object item [2] ('2 0 obj') end position not found Object item [3] ('3 0 obj') start position not found Object item [3] ('3 0 obj') end position not found Object item [4] ('4 0 obj') start position not found Object item [4] ('4 0 obj') end position not found Object item [5] ('5 0 obj') start position not found Object item [5] ('5 0 obj') end position not found Object item [6] ('6 0 obj') start position not found Object item [6] ('6 0 obj') end position not found Object item [7] ('7 0 obj') start position not found Object item [7] ('7 0 obj') end position not found Object item [8] ('8 0 obj') start position not found Object item [8] ('8 0 obj') end position not found Object item [9] ('9 0 obj') start position not found Object item [9] ('9 0 obj') end position not found Object item [10] ('10 0 obj') start position not found Object item [10] ('10 0 obj') end position not found

      The /m is apparently indispensable in this regex setup, so I also tried /ma and /maa combinations. When doing this, I get the results back, but incorrect (same results as my very initial attempts...).

      Object item [1] ('1 0 obj') starts at position [19] Object item [1] ('1 0 obj') ends at position [26] Object item [2] ('2 0 obj') starts at position [235] Object item [2] ('2 0 obj') ends at position [242] Object item [3] ('3 0 obj') starts at position [344] Object item [3] ('3 0 obj') ends at position [351] Object item [4] ('4 0 obj') starts at position [667] Object item [4] ('4 0 obj') ends at position [674] Object item [5] ('5 0 obj') starts at position [2663] Object item [5] ('5 0 obj') ends at position [2670] Object item [6] ('6 0 obj') starts at position [3139] Object item [6] ('6 0 obj') ends at position [3146] Object item [7] ('7 0 obj') starts at position [3514] Object item [7] ('7 0 obj') ends at position [3521] Object item [8] ('8 0 obj') starts at position [3839] Object item [8] ('8 0 obj') ends at position [3846] Object item [9] ('9 0 obj') starts at position [5063] Object item [9] ('9 0 obj') ends at position [5070] Object item [10] ('10 0 obj') starts at position [5501] Object item [10] ('10 0 obj') ends at position [5509]

      Best rgds,
      Geert
        When doing this, I get the results back, but incorrect (same results as my very initial attempts...)

        Show xref-table fragment for objects 1-10, or, better yet, provide a link to the test file. + Your approach to PDF hacking is seriously flawed, listen to what AM says and use proper API (CAM::PDF). You don't need to manually touch, contract or edit xref-table after deleting an object because it's done for you automatically, that's what API's for. Only pay attention that deleteObject is among "deeper utilities" for a reason -- one generally doesn't need to call it neither; unused objects will be cleansed for you automatically, too.

        Sorry, my reply is quite late as well.

        The /m is apparently indispensable in this regex setup, so I also tried /ma and /maa combinations.

        Yes, that's what I meant, sorry - often people will say e.g. "the /m and /s modifiers" to distinguish modifiers visually from m// and s/// or other things, but that doesn't mean to say those are the only modifiers that should be applied to the regex.

        I've followed your advice, opened the file in raw mode. Before, I was using the module use Path::Tiny; and then path('<file.pdf>')->slurp_raw; to open the file in raw mode. I guess that's the same behaviour? Anyway, I followed your advice and came up with the following test program

        Yes, it's the same behavior. You didn't show that in your original node, and for this node you've shown what looks to be a complete script, but you're not showing your input (in a format compatible with text-only display, such as hexdump -C input.pdf or od -tx1c input.pdf) or your expected output, leaving us to guess what the issue is. This is why Short, Self-Contained, Correct Examples are so important, so that we can reproduce the issue. Please show: Short, representative sample input, a runnable piece of code, the expected output for the input, and the actual output, including any error messages.

Re: Calculated position incorrect when using regex in text file that also contains binary info
by hippo (Bishop) on Jun 14, 2020 at 10:10 UTC

    Works fine for me:

    use strict; use warnings; use Test::More; my $pdf = <<'EOT'; /CapHeight 667 >> endobj 1 0 obj << /Subtype /Type1C /Length 1194 2 0 obj >> stream 3 0 obj BXTBMO+URWTypewriterTOT-LigNar % N% Lm /> 7D(URW)++,Copyright 2003 by (URW)++ Design & Development/FSType +4 def - B F H J O S r SOb&#49912;&#1400;K{zw }}}{p{|}}}}X{|}M}g4z|k}&#899;rm\\Hf|oaviZb +{~&#532;שvtuJY<Z{hYso^P4U&#7980;&#1708;.!e~L~u^r^sP a&#303; ..-/}wH{Ue?^lrz|pl^dI#+ o.uovat_wJY{~~zujvo~Ѹ |3{zI\\sO8=1\'#-- +ط|lgfAGfsken^uZ K.Lz4|~}}}e|~6|k}luyyuuxyto +/Lz3|~}}}d|~=|e}i&#1653;_9}}e|~=|d}S%>TfIu +o/Lz3|~}}}d|~M|U}\\&#1765;0}rec{rqyy{wdK \\ endstream endobj 4 0 obj << /Length 422 >> EOT my @tests = ( { no => 1, start => 26, end => 33 }, { no => 2, start => 68, end => 75 }, { no => 3, start => 87, end => 94 }, { no => 4, start => 1315, end => 1322 }, ); plan tests => 2 * @tests; for my $t (@tests) { my $result = $pdf =~ qr/^$t->{no} 0 obj/m; is $-[0], $t->{start}, "No $t->{no} first match starts at $t->{sta +rt}"; is $+[0], $t->{end}, "No $t->{no} first match ends at $t->{end}" +; }

    Perl 5.20.3 on Linux

      Hello hippo,

      I just tried your code on a Windows 10 machine and here it fails. Result:

      1..8 ok 1 - No 1 first match starts at 26 ok 2 - No 1 first match ends at 33 ok 3 - No 2 first match starts at 68 ok 4 - No 2 first match ends at 75 ok 5 - No 3 first match starts at 87 ok 6 - No 3 first match ends at 94 not ok 7 - No 4 first match starts at 1315 # Failed test 'No 4 first match starts at 1315' # at E:\AppData\Programming\Perl\ReadFile/hippo.pl line 50. # got: '1139' # expected: '1315' not ok 8 - No 4 first match ends at 1322 # Failed test 'No 4 first match ends at 1322' # at E:\AppData\Programming\Perl\ReadFile/hippo.pl line 51. # got: '1146' # expected: '1322'
      Even the "got" values (calculated based on the Windows OS) are not correct when checking the location of the items in a text editor (I'm using NPP as text editor) in Windows. It should be 923 and 930 for the 4th object (as opposed to 1139 and 1146 in your script result).

      So, "a" conclusion might be that Windows (once again) differs from Linux in the way files are handled.

      Would there be a need to open the file with a specific encoding?

      Best,
      Geert
        So, "a" conclusion might be that Windows (once again) differs from Linux in the way files are handled.

        Quite probably. Having no recent experience of MS Windows I'm not in a position to help you further, unfortunately. In your shoes I would try to persist with the SSCCE, however. The fact that it currently fails shows the problem in isolation which might help to solve it. Good luck.

        Hi

        The proper way to share binary data inside perl programs is to use Data::Dump::dd()

        This is especially true if you share that file on the internet

        Without this you have no basis for any conclusions

Re: Calculated position incorrect when using regex in text file that also contains binary info
by jcb (Parson) on Jun 15, 2020 at 02:10 UTC

    If I remember correctly, there is (supposed to be) a central table near the end of the PDF file that lists all of the actual objects in the file and their offsets. You will need to read the data from there because you have no guarantee that a binary stream will not happen to contain a byte sequence that looks like the beginning of an object, unless you can parse all of the objects in the PDF. This may seem spectacularly unlikely, but it will be a serious problem if you are handling untrusted and potentially malicious input.

      Hi jcb,

      You're absolutely correct. At the end of a PDF file, there's a size section indicated by /Size <nr_of_cos_objects> that informs you how many x 0 obj there are in the PDF file. COS = Carousel Object System and refers to the original code used by Adobe (not used anymore as such, though...).

      Just before that section, there's a "table" that tells you how many bytes (= offset) there are from the beginning of the file to a certain block, like so:

      endobj xref 0 7139 0000000000 65535 f 0000000015 00000 n 0000012681 00000 n 0000025600 00000 n 0000058867 00000 n 0050527288 00000 n 0000023513 00000 n 0000020738 00000 n 0000018831 00000 n 0000016437 00000 n 0000012809 00000 n 0050520688 00000 n 0050527008 00000 n 0000000484 00000 n
      What you see here is the total amount of COS objects (7139 in this case, which is repeated later on within a separate section indicated with /Size, like I explained above), followed by the "table" that indicates the offset of every block, starting from 0 (this one can and should be ignored, since there's no COS starting with 0 itself).

      So, object 1 0 obj is located at a 15 bytes offset position from the start of the file, 2 0 obj has an offset of 12681 bytes from the start of the file and so on. Well, you get the picture...

      If I now change the content of the PDF (that is, removing such COS sections), then the table is not correct anymore. When you open such files, your PDF reader will (and should) complain that the xref table is not correct anymore (obviously).
      Then you have 2 options:
      1. Leave it as it is and save the file. I think a good PDF reader might resolve/recalculate the table for you, prior to saving the file. However, I'm not sure of that.
      2. Recalculate the new table yourself before you share the document with someone else.

      Since I don't want to bother that "someone else" with error messages when opening the modified PDF file, I chose for the latter solution: recalculating the table myself.

      Hence, this is why I want to recalculate for each and every object in the PDF file its offset after modifying the content of the file.

      Your remark of the binary stream having a byte sequence that matches the beginning of an object is correct, but therefor my regex forces this byte sequence to be found at the beginning of a line. And I know, there's still a chance that a binary stream might also start with such byte sequence, but don't you thing such chances are odd?

      Anyway, when using Python, I can recalculate the table perfectly. Even with the binary content in the file, it works flawlessly.

      One big disadvantage: it takes ages and ages before the recalculation is finished, especially when you have several thousands of COS items to recalculate. Knowing Perl is much better for text processing in any way, I think it will do the recalculation much faster. Hence, I chose to use Perl to help me doing the job...


      Best rgds,
      Geert

        Have you considered reading the xref table before beginning your manipulations, using it to calculate the sizes of the objects, reading the objects as binary records (set $/ to a reference to a number or use read) using the xref information, and then simply calculating and writing a new xref table? That should be faster still than asking the regex engine to scan the entire contents of a PDF.

        therefor my regex forces this byte sequence to be found at the beginning of a line

        A binary stream can also contain an end-of-line sequence, especially if we consider maliciously crafted input.

Re: Calculated position incorrect when using regex in text file that also contains binary info
by Anonymous Monk on Jun 14, 2020 at 12:36 UTC
      Hi Anonymous Monk,

      Your answer (or was it a question?) was at first not clear to me. But then I realised you want to point to an exising Perl module.

      I was not aware of the existence of such module, sorry. I looked into the methods (external ones, less external ones and deeper ones) and found some that might help me doing other things indeed, like $self->deleteObject(objectnum).

      However, there's no such method like recalculate the XREF table (at least, I couldn't find one that looks like doing this), so I might still do the recalculation myself.

      But it's anyhow nice/good to know such Perl module exists, so thanks for this!

      Best rgds,
      Geert