Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

Re^2: Calculated position incorrect when using regex in text file that also contains binary info (updated)

by geertvc (Sexton)
on Jun 17, 2020 at 04:36 UTC ( [id://11118163]=note: print w/replies, xml ) Need Help??


in reply to Re: Calculated position incorrect when using regex in text file that also contains binary info (updated)
in thread Calculated position incorrect when using regex in text file that also contains binary info

Hello haukex,

Sorry for the late reply, but I was for a few days not able to work on that one.

I've followed your advice, opened the file in raw mode. Before, I was using the module use Path::Tiny; and then path('<file.pdf>')->slurp_raw; to open the file in raw mode. I guess that's the same behaviour?

Anyway, I followed your advice and came up with the following test program:

use v5.010; use strict; use warnings; # Reading in file in raw format. local $/; open F, "<:raw", "input.pdf" or die $!; my $raw_content = <F>; my $nr_of_cos_objects = 10; my @counter = (1..$nr_of_cos_objects); my $position = 0; for my $number (@counter) { my $result = $raw_content=~qr/^${number} 0 obj/aa; if ($result) { say "Object item [$number] ('${number} 0 obj') starts at posit +ion [$-[0]]"; } else { say "Object item [$number] ('${number} 0 obj') start position +not found"; } if ($result) { say "Object item [$number] ('${number} 0 obj') ends at posit +ion [$+[0]]\n"; } else { say "Object item [$number] ('${number} 0 obj') end position +not found\n"; } } say "End test program. Bye...";
Result: nothing is found at all when using /aa (or /a). This is the output:
Object item [1] ('1 0 obj') start position not found Object item [1] ('1 0 obj') end position not found Object item [2] ('2 0 obj') start position not found Object item [2] ('2 0 obj') end position not found Object item [3] ('3 0 obj') start position not found Object item [3] ('3 0 obj') end position not found Object item [4] ('4 0 obj') start position not found Object item [4] ('4 0 obj') end position not found Object item [5] ('5 0 obj') start position not found Object item [5] ('5 0 obj') end position not found Object item [6] ('6 0 obj') start position not found Object item [6] ('6 0 obj') end position not found Object item [7] ('7 0 obj') start position not found Object item [7] ('7 0 obj') end position not found Object item [8] ('8 0 obj') start position not found Object item [8] ('8 0 obj') end position not found Object item [9] ('9 0 obj') start position not found Object item [9] ('9 0 obj') end position not found Object item [10] ('10 0 obj') start position not found Object item [10] ('10 0 obj') end position not found

The /m is apparently indispensable in this regex setup, so I also tried /ma and /maa combinations. When doing this, I get the results back, but incorrect (same results as my very initial attempts...).

Object item [1] ('1 0 obj') starts at position [19] Object item [1] ('1 0 obj') ends at position [26] Object item [2] ('2 0 obj') starts at position [235] Object item [2] ('2 0 obj') ends at position [242] Object item [3] ('3 0 obj') starts at position [344] Object item [3] ('3 0 obj') ends at position [351] Object item [4] ('4 0 obj') starts at position [667] Object item [4] ('4 0 obj') ends at position [674] Object item [5] ('5 0 obj') starts at position [2663] Object item [5] ('5 0 obj') ends at position [2670] Object item [6] ('6 0 obj') starts at position [3139] Object item [6] ('6 0 obj') ends at position [3146] Object item [7] ('7 0 obj') starts at position [3514] Object item [7] ('7 0 obj') ends at position [3521] Object item [8] ('8 0 obj') starts at position [3839] Object item [8] ('8 0 obj') ends at position [3846] Object item [9] ('9 0 obj') starts at position [5063] Object item [9] ('9 0 obj') ends at position [5070] Object item [10] ('10 0 obj') starts at position [5501] Object item [10] ('10 0 obj') ends at position [5509]

Best rgds,
Geert
  • Comment on Re^2: Calculated position incorrect when using regex in text file that also contains binary info (updated)
  • Select or Download Code

Replies are listed 'Best First'.
Re^3: Calculated position incorrect when using regex in text file that also contains binary info (updated)
by vr (Curate) on Jun 17, 2020 at 06:29 UTC
    When doing this, I get the results back, but incorrect (same results as my very initial attempts...)

    Show xref-table fragment for objects 1-10, or, better yet, provide a link to the test file. + Your approach to PDF hacking is seriously flawed, listen to what AM says and use proper API (CAM::PDF). You don't need to manually touch, contract or edit xref-table after deleting an object because it's done for you automatically, that's what API's for. Only pay attention that deleteObject is among "deeper utilities" for a reason -- one generally doesn't need to call it neither; unused objects will be cleansed for you automatically, too.

      Hi,

      For sure I will take a look at the CAM::PDF module, I stated that in another reply here somewhere. But I would like to know why you say my approach of PDF hacking is seriously flawed? Can you explain?

      As I also replied somewhere else in this thread, I have Python code that works perfect and does the job 100% correct on the same file content, but it's extremely slow. That's the reason why I would like to give it a try with Perl, seen it outperforms many other languages with respect to text manipulation (wasn't this one of the main reasons Perl was developed in the first place?).

      So, I'm puzzled as to why my approach is flawed. Curious to hear/read your rationale behind this...


      Best rgds,
      Geert
Re^3: Calculated position incorrect when using regex in text file that also contains binary info
by haukex (Archbishop) on Jun 21, 2020 at 10:18 UTC

    Sorry, my reply is quite late as well.

    The /m is apparently indispensable in this regex setup, so I also tried /ma and /maa combinations.

    Yes, that's what I meant, sorry - often people will say e.g. "the /m and /s modifiers" to distinguish modifiers visually from m// and s/// or other things, but that doesn't mean to say those are the only modifiers that should be applied to the regex.

    I've followed your advice, opened the file in raw mode. Before, I was using the module use Path::Tiny; and then path('<file.pdf>')->slurp_raw; to open the file in raw mode. I guess that's the same behaviour? Anyway, I followed your advice and came up with the following test program

    Yes, it's the same behavior. You didn't show that in your original node, and for this node you've shown what looks to be a complete script, but you're not showing your input (in a format compatible with text-only display, such as hexdump -C input.pdf or od -tx1c input.pdf) or your expected output, leaving us to guess what the issue is. This is why Short, Self-Contained, Correct Examples are so important, so that we can reproduce the issue. Please show: Short, representative sample input, a runnable piece of code, the expected output for the input, and the actual output, including any error messages.

      Hi haukex,

      I really have to learn to be more precise (and maybe concise too???) in my answers. Really sorry for that, I'm feeling a bit uncomfortable and embarrassed now...

      I will prepare such SSCCE example (again learned a new abbrev...) ASAP, promised. Could you tell me how to attach a file to a message, if that's allowed?

      Thanks for taking the time to point me to my weak points, I can only learn from this...

      Best rgds,
      Geert
        I really have to learn to be more precise (and maybe concise too???) in my answers. Really sorry for that, I'm feeling a bit uncomfortable and embarrassed now...

        Don't worry! Even the pros sometimes need to be reminded of SSCCE, sometimes because one becomes so deeply buried in a problem that one forgets that not everyone else is so into it as well :-)

        Could you tell me how to attach a file to a message, if that's allowed?

        Everything goes in <code> tags, individual ones for input, code, output, etc., so it's easier to download. As long as it's ASCII, it'll work fine, hence my suggestion for showing binary data via hexdump -C file or od -tx1c file (Update: or on Windows, I like this little tool, see the "Releases" for a single-exe download). There are other potential formats too, like for example Data::Dump will automatically use pack or MIME::Base64 for binary data as necessary - however, if you use this module to show binary data, make sure you've read the data from the file in "raw" format, that is, with open mode '<:raw', binmoded the filehandle before reading, or an equivalent method like slurp_raw from Path::Tiny.

        So for example, you can use the following to take the binary file $filename and output its contents in a Perl format, suitable for pasting into your SSCCE:

        use Data::Dump qw/pp/; print 'my $data = ',pp(do { open my $fh, '<:raw', $filename or die $!; local $/; <$fh> }),";\n";

        Sometimes, if the problem might be related to the UTF-8 encoding of the Perl source code itself (use utf8;), the source can be posted inside <pre> instead of <code> tags, but only if all HTML special characters are escaped - one way is "perl -MHTML::Entities -CSD -pe 'encode_entities $_' source.pl". Personally I try to avoid this, and use escapes like "\N{U+0000}" or "\N{CHARNAME}", so the source code can stay in ASCII. And as hippo said, use PerlMonks' <readmore> tags if necessary, although you should try to keep things as short as possible - but still representative.

        If the only way to demonstrate a problem is with a fairly large file, then sometimes it's possible to provide a short script that generates such a file here, instead of the file itself (it shouldn't use rand or similar though, for reproducibility). And in the very worst case, large files can be uploaded to third-party sites, although that's the least preferable method because it's not permanent.

        In your case, I think you should be able to edit down your input file to something short enough to post here that still reproduces the issue, following the above guidelines. And as usual in such cases, this may even help you pinpoint the problem better.

        There's no concept of attachments here. If you have code, enclose it in <code>...</code> tags. If you have lots of code, enclose it in <readmore>...</readmore> tags but only after remembering that the first S in SSCCE stands for "short".

        See also Writeup Formatting Tips.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11118163]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (3)
As of 2024-04-19 19:42 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found