http://www.perlmonks.org?node_id=11118037

geertvc has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I have an uncompressed PDF file. I'm trying to find all locations/positions that contain the following format: <digit> 0 obj at the beginning of a line. E.g.: 3 0 obj.
Once I found an occurrence, I want to know how much bytes are "occupied" from the beginning of the file until that position found.

The file also contains, next to regular ASCII characters, binary data. A snippet is shown here below:

/CapHeight 667 >> endobj 1 0 obj << /Subtype /Type1C /Length 1194 2 0 obj >> stream 3 0 obj   BXTBMO+URWTypewriterTOT-LigNar %øøø ûûúNú% Lm ÷/÷>¦ 7D(URW)++,Copyright 2003 by (URW)++ Design & Developmen +t/FSType 4 def - B F H J O S    r ¡S»OÑûb싸øÕ¸÷Å÷°À +ø÷KŸ—{zwû }†…}ûŠ}†‘™ø­™‘™Ô›”“š™‚“{ûp{ƒ|}•ƒ›¾™…}ü­}†…}X{ƒ|}•ƒ› +øMœ’’Å}µ÷g°÷4µµÄ÷¢Áø£z’„œÜš”“˜˜‚’|k}†‘š÷΃ªr¨°m\\œH +^f„|oaviZb{—~™™–•›Ȕ¸¬×¶©‚v¡¤t‘uJˆ„„‰ƒ‹Œ„YŒ<Z„{hYso^P4Ç +UîἬڬ.÷!±´‰‰ˆŽ†„e~L~u^r^sPDa±ÊÕį÷ º}µ÷V´÷Aµ±ÆøN÷rš“’› +¥ƒº¨àoH¹+û8.û.û-Þ/÷ÖÀ¢º®Ÿ¥š­œ›‚–}…„w†H{Ue?T^Ÿ±lr©zº±–’‘ +˜´‚†•«Ÿ¼¥©©¤´œº·±|p¤¨lž^d……òûIµ÷#µø ´oµ±Ã÷ÐÁ¬ø.…uƒovat_wJ>Y¨¾ˆ„’{{~~zužj¤vo¬º~ÑÛ»¸¯§­—¶Ïø  +™‘™²š”’˜™‚’|3{„„zI\\ÎsO°8û=1û\'û#Ý-÷¹ž¯®œ‘–’£û-÷Ø·®|l +©­g–fAG‚fsken^uZ+JÛ÷÷ ÊÖï‹´ø´öÜÞÜKÁè÷.øLœ„’z4|‚ƒ~”ƒš± +™…}ûê}†…}e|‚ƒ~”ƒš÷6š”“—˜‚“|k}†‘™ðløãuyyuužx¡¡¡¢ytñ‹ +´ø´oµðÁ÷–ÁØ÷/øLœ„’z3|‚ƒ~”ƒš²™…}ûê}†…}d|‚ƒ~”ƒš÷=š”“—˜ +‚“|e}†‘™÷i¸ïÀÉàÙµ_9û}†…}e|‚ƒ~”ƒš÷=š”“—˜‚“|d}†‘™÷œîSÁ +%>TfIu ‹´ø´oµðÁÐ÷/øLœ„’z3|‚ƒ~”ƒš²™…}ûê}†…}d|‚ƒ~”ƒš÷ +Mš”“—˜‚“|U}†‘™÷\\ۥº¢0”ž©“œŽ‹‰‘Š’Ž‹”‘’—›}•rec{rqyy‚{‚i +wŸødŸ÷KŸ¶ à ¶Ž Ò ÷Œø\\ endstream endobj 4 0 obj << /Length 422 >>

To find the locations in the above code snippet, I'm using a regex like so: qr/^\d+ 0 obj/m.

This is the test code I'm using ($pdf contains the string and \d is replace with a fixed number as a test):

my $result = $pdf=~qr/^1 0 obj/m; say "Finding first item at start position [$-[0]]" if $result; say "Finding first item at start position [$+[0]]\n" if $result; $result = $pdf=~qr/^2 0 obj/m; say "Finding second item at start position [$-[0]]" if $result; say "Finding second item at start position [$+[0]]\n" if $result; $result = $pdf=~qr/^3 0 obj/m; say "Finding third item at start position [$-[0]]" if $result; say "Finding third item at start position [$+[0]]\n" if $result; $result = $pdf=~qr/^4 0 obj/m; say "Finding fourth item at start position [$-[0]]" if $result; say "Finding fourth item at start position [$+[0]]\n" if $result;
This results in the following output:
Finding first item at start position [26] Finding first item at start position [33] Finding second item at start position [68] Finding second item at start position [75] Finding third item at start position [87] Finding third item at start position [94] Finding fourth item at start position [2035] Finding fourth item at start position [2042]

The first 3 results are correct, the 4th result is not. Reason: the first 3 results were found in "pure ASCII" text while the 4th match (4 0 obj) is in a section that's behind the section that contains the binary content. The result should be 1315 and 1322 instead of 2035 and 2042.

I'm afraid there's a problem with my regex when the text file also contains binary data.

In Python, you can tell the regex to "behave" binary by adding a b in front of the regex, like so: re.search(b'^\d+ 0 obj'... and that works perfect. The locations found are spot on.

How can this be done in Perl?

Best,
Geert