Re^2: Calculated position incorrect when using regex in text file that also contains binary info (updated)

Hello haukex,

Sorry for the late reply, but I was for a few days not able to work on that one.

I've followed your advice, opened the file in raw mode. Before, I was using the module use Path::Tiny; and then path('<file.pdf>')->slurp_raw; to open the file in raw mode. I guess that's the same behaviour?

Anyway, I followed your advice and came up with the following test program:

use v5.010;
use strict;
use warnings;

# Reading in file in raw format.
local $/;
open F, "<:raw", "input.pdf" or die $!;
my $raw_content = <F>;

my $nr_of_cos_objects = 10;
my @counter = (1..$nr_of_cos_objects);
my $position = 0;

for my $number (@counter) {
    my $result = $raw_content=~qr/^${number} 0 obj/aa;
    if ($result) {
        say "Object item [$number] ('${number} 0 obj') starts at posit
+ion  [$-[0]]";
    } else {
        say "Object item [$number] ('${number} 0 obj') start position 
+not found";
    }

    if ($result) {
        say "Object item [$number] ('${number} 0 obj') ends   at posit
+ion  [$+[0]]\n";
    } else {
        say "Object item [$number] ('${number} 0 obj') end   position 
+not found\n";
    }
}

say "End test program.  Bye...";
[download]

Result: nothing is found at all when using /aa (or /a). This is the output:

Object item [1] ('1 0 obj') start position not found
Object item [1] ('1 0 obj') end   position not found

Object item [2] ('2 0 obj') start position not found
Object item [2] ('2 0 obj') end   position not found

Object item [3] ('3 0 obj') start position not found
Object item [3] ('3 0 obj') end   position not found

Object item [4] ('4 0 obj') start position not found
Object item [4] ('4 0 obj') end   position not found

Object item [5] ('5 0 obj') start position not found
Object item [5] ('5 0 obj') end   position not found

Object item [6] ('6 0 obj') start position not found
Object item [6] ('6 0 obj') end   position not found

Object item [7] ('7 0 obj') start position not found
Object item [7] ('7 0 obj') end   position not found

Object item [8] ('8 0 obj') start position not found
Object item [8] ('8 0 obj') end   position not found

Object item [9] ('9 0 obj') start position not found
Object item [9] ('9 0 obj') end   position not found

Object item [10] ('10 0 obj') start position not found
Object item [10] ('10 0 obj') end   position not found
[download]

The /m is apparently indispensable in this regex setup, so I also tried /ma and /maa combinations. When doing this, I get the results back, but incorrect (same results as my very initial attempts...).

Object item [1] ('1 0 obj') starts at position  [19]
Object item [1] ('1 0 obj') ends   at position  [26]

Object item [2] ('2 0 obj') starts at position  [235]
Object item [2] ('2 0 obj') ends   at position  [242]

Object item [3] ('3 0 obj') starts at position  [344]
Object item [3] ('3 0 obj') ends   at position  [351]

Object item [4] ('4 0 obj') starts at position  [667]
Object item [4] ('4 0 obj') ends   at position  [674]

Object item [5] ('5 0 obj') starts at position  [2663]
Object item [5] ('5 0 obj') ends   at position  [2670]

Object item [6] ('6 0 obj') starts at position  [3139]
Object item [6] ('6 0 obj') ends   at position  [3146]

Object item [7] ('7 0 obj') starts at position  [3514]
Object item [7] ('7 0 obj') ends   at position  [3521]

Object item [8] ('8 0 obj') starts at position  [3839]
Object item [8] ('8 0 obj') ends   at position  [3846]

Object item [9] ('9 0 obj') starts at position  [5063]
Object item [9] ('9 0 obj') ends   at position  [5070]

Object item [10] ('10 0 obj') starts at position  [5501]
Object item [10] ('10 0 obj') ends   at position  [5509]
[download]

Best rgds,
Geert

Comment on Re^2: Calculated position incorrect when using regex in text file that also contains binary info (updated) Select or Download Code

Replies are listed 'Best First'.
Re^3: Calculated position incorrect when using regex in text file that also contains binary info (updated) by vr (Curate) on Jun 17, 2020 at 06:29 UTC
When doing this, I get the results back, but incorrect (same results as my very initial attempts...) Show xref-table fragment for objects 1-10, or, better yet, provide a link to the test file. + Your approach to PDF hacking is seriously flawed, listen to what AM says and use proper API (CAM::PDF). You don't need to manually touch, contract or edit xref-table after deleting an object because it's done for you automatically, that's what API's for. Only pay attention that `deleteObject` is among "deeper utilities" for a reason -- one generally doesn't need to call it neither; unused objects will be cleansed for you automatically, too.	[reply] [d/l]
Re^4: Calculated position incorrect when using regex in text file that also contains binary info (updated) by geertvc (Sexton) on Jun 17, 2020 at 17:37 UTC
Hi, For sure I will take a look at the `CAM::PDF` module, I stated that in another reply here somewhere. But I would like to know why you say my approach of PDF hacking is seriously flawed? Can you explain? As I also replied somewhere else in this thread, I have Python code that works perfect and does the job 100% correct on the same file content, but it's extremely slow. That's the reason why I would like to give it a try with Perl, seen it outperforms many other languages with respect to text manipulation (wasn't this one of the main reasons Perl was developed in the first place?). So, I'm puzzled as to why my approach is flawed. Curious to hear/read your rationale behind this... Best rgds, Geert	[reply] [d/l]
Re^3: Calculated position incorrect when using regex in text file that also contains binary info by haukex (Archbishop) on Jun 21, 2020 at 10:18 UTC
Sorry, my reply is quite late as well. The `/m` is apparently indispensable in this regex setup, so I also tried `/ma` and `/maa` combinations. Yes, that's what I meant, sorry - often people will say e.g. "the `/m` and `/s` modifiers" to distinguish modifiers visually from `m//` and `s///` or other things, but that doesn't mean to say those are the only modifiers that should be applied to the regex. I've followed your advice, opened the file in `raw` mode. Before, I was using the module `use Path::Tiny;` and then `path('<file.pdf>')->slurp_raw;` to open the file in raw mode. I guess that's the same behaviour? Anyway, I followed your advice and came up with the following test program Yes, it's the same behavior. You didn't show that in your original node, and for this node you've shown what looks to be a complete script, but you're not showing your input (in a format compatible with text-only display, such as `hexdump -C input.pdf` or `od -tx1c input.pdf`) or your expected output, leaving us to guess what the issue is. This is why Short, Self-Contained, Correct Examples are so important, so that we can reproduce the issue. Please show: Short, representative sample input, a runnable piece of code, the expected output for the input, and the actual output, including any error messages.	[reply] [d/l] [select]
Re^4: Calculated position incorrect when using regex in text file that also contains binary info by geertvc (Sexton) on Jun 21, 2020 at 17:47 UTC
Hi haukex, I really have to learn to be more precise (and maybe concise too???) in my answers. Really sorry for that, I'm feeling a bit uncomfortable and embarrassed now... I will prepare such SSCCE example (again learned a new abbrev...) ASAP, promised. Could you tell me how to attach a file to a message, if that's allowed? Thanks for taking the time to point me to my weak points, I can only learn from this... Best rgds, Geert	[reply]
Re^5: Calculated position incorrect when using regex in text file that also contains binary info by haukex (Archbishop) on Jun 21, 2020 at 19:01 UTC
I really have to learn to be more precise (and maybe concise too???) in my answers. Really sorry for that, I'm feeling a bit uncomfortable and embarrassed now... Don't worry! Even the pros sometimes need to be reminded of SSCCE, sometimes because one becomes so deeply buried in a problem that one forgets that not everyone else is so into it as well :-) Could you tell me how to attach a file to a message, if that's allowed? Everything goes in `<code>` tags, individual ones for input, code, output, etc., so it's easier to download. As long as it's ASCII, it'll work fine, hence my suggestion for showing binary data via `hexdump -C file` or `od -tx1c file` (Update: or on Windows, I like this little tool, see the "Releases" for a single-exe download). There are other potential formats too, like for example Data::Dump will automatically use pack or MIME::Base64 for binary data as necessary - however, if you use this module to show binary data, make sure you've read the data from the file in "raw" format, that is, with open mode `'<:raw'`, binmoded the filehandle before reading, or an equivalent method like `slurp_raw` from Path::Tiny. So for example, you can use the following to take the binary file `$filename` and output its contents in a Perl format, suitable for pasting into your SSCCE: `use Data::Dump qw/pp/; print 'my $data = ',pp(do { open my $fh, '<:raw', $filename or die $!; local $/; <$fh> }),";\n";` [download] Sometimes, if the problem might be related to the UTF-8 encoding of the Perl source code itself (`use utf8;`), the source can be posted inside `<pre>` instead of `<code>` tags, but only if all HTML special characters are escaped - one way is "`perl -MHTML::Entities -CSD -pe 'encode_entities $_' source.pl`". Personally I try to avoid this, and use escapes like `"\N{U+0000}"` or `"\N{CHARNAME}"`, so the source code can stay in ASCII. And as hippo said, use PerlMonks' `<readmore>` tags if necessary, although you should try to keep things as short as possible - but still representative. If the only way to demonstrate a problem is with a fairly large file, then sometimes it's possible to provide a short script that generates such a file here, instead of the file itself (it shouldn't use rand or similar though, for reproducibility). And in the very worst case, large files can be uploaded to third-party sites, although that's the least preferable method because it's not permanent. In your case, I think you should be able to edit down your input file to something short enough to post here that still reproduces the issue, following the above guidelines. And as usual in such cases, this may even help you pinpoint the problem better.	[reply] [d/l] [select]
Re^6: Calculated position incorrect when using regex in text file that also contains binary info by geertvc (Sexton) on Jun 27, 2020 at 06:41 UTC
Re^7: Calculated position incorrect when using regex in text file that also contains binary info by haukex (Archbishop) on Jun 29, 2020 at 14:52 UTC
Re^6: Calculated position incorrect when using regex in text file that also contains binary info by Anonymous Monk on Jun 21, 2020 at 19:16 UTC
Re^5: Calculated position incorrect when using regex in text file that also contains binary info by hippo (Bishop) on Jun 21, 2020 at 18:08 UTC
There's no concept of attachments here. If you have code, enclose it in `<code>...</code>` tags. If you have lots of code, enclose it in `<readmore>...</readmore>` tags but only after remembering that the first S in SSCCE stands for "short". See also Writeup Formatting Tips.	[reply] [d/l] [select]