Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

Re^4: Calculated position incorrect when using regex in text file that also contains binary info

by geertvc (Sexton)
on Jun 21, 2020 at 17:47 UTC ( #11118318=note: print w/replies, xml ) Need Help??


in reply to Re^3: Calculated position incorrect when using regex in text file that also contains binary info
in thread Calculated position incorrect when using regex in text file that also contains binary info

Hi haukex,

I really have to learn to be more precise (and maybe concise too???) in my answers. Really sorry for that, I'm feeling a bit uncomfortable and embarrassed now...

I will prepare such SSCCE example (again learned a new abbrev...) ASAP, promised. Could you tell me how to attach a file to a message, if that's allowed?

Thanks for taking the time to point me to my weak points, I can only learn from this...

Best rgds,
Geert
  • Comment on Re^4: Calculated position incorrect when using regex in text file that also contains binary info

Replies are listed 'Best First'.
Re^5: Calculated position incorrect when using regex in text file that also contains binary info
by haukex (Bishop) on Jun 21, 2020 at 19:01 UTC
    I really have to learn to be more precise (and maybe concise too???) in my answers. Really sorry for that, I'm feeling a bit uncomfortable and embarrassed now...

    Don't worry! Even the pros sometimes need to be reminded of SSCCE, sometimes because one becomes so deeply buried in a problem that one forgets that not everyone else is so into it as well :-)

    Could you tell me how to attach a file to a message, if that's allowed?

    Everything goes in <code> tags, individual ones for input, code, output, etc., so it's easier to download. As long as it's ASCII, it'll work fine, hence my suggestion for showing binary data via hexdump -C file or od -tx1c file. There are other potential formats too, like for example Data::Dump will automatically use pack or MIME::Base64 for binary data as necessary - however, if you use this module to show binary data, make sure you've read the data from the file in "raw" format, that is, with open mode '<:raw', binmoded the filehandle before reading, or an equivalent method like slurp_raw from Path::Tiny.

    So for example, you can use the following to take the binary file $filename and output its contents in a Perl format, suitable for pasting into your SSCCE:

    use Data::Dump qw/pp/; print 'my $data = ',pp(do { open my $fh, '<:raw', $filename or die $!; local $/; <$fh> }),";\n";

    Sometimes, if the problem might be related to the UTF-8 encoding of the Perl source code itself (use utf8;), the source can be posted inside <pre> instead of <code> tags, but only if all HTML special characters are escaped - one way is "perl -MHTML::Entities -CSD -pe 'encode_entities $_' source.pl". Personally I try to avoid this, and use escapes like "\N{U+0000}" or "\N{CHARNAME}", so the source code can stay in ASCII. And as hippo said, use PerlMonks' <readmore> tags if necessary, although you should try to keep things as short as possible - but still representative.

    If the only way to demonstrate a problem is with a fairly large file, then sometimes it's possible to provide a short script that generates such a file here, instead of the file itself (it shouldn't use rand or similar though, for reproducibility). And in the very worst case, large files can be uploaded to third-party sites, although that's the least preferable method because it's not permanent.

    In your case, I think you should be able to edit down your input file to something short enough to post here that still reproduces the issue, following the above guidelines. And as usual in such cases, this may even help you pinpoint the problem better.

      Hi haukex,

      I have to make a really big confession... After all, we're in a monastery. With monks. So it should be possible, allowed and doable, right? ;-)

      I made a horrible mistake by checking the size of the calculated offset in my text editor (NPP) instead of comparing it with the XREF table which is part of the PDF itself. If I do this, then all is working perfect. I'm a bit of a dumb now...

      Since I'm so pissed off (of myself, I mean) I stubbornly refused to give up and moved forward with making my SSCCE (as promised), even if it's only to "prove" myself I'm able to do this... :-)

      And here's the result (in case it might be useful for someone else in the future that doesn't make such a silly mistake as I did...)

      I've tried to assemble an example that should (hopefully) be SSCCE-compliant. Fingers crossed I got it right this time.

      First, the original, uncompressed PDF file. It's a simple and very small PDF document called example_uncompressed.pdf. It's been made with LibreOffice and saved as PDF. It contains only one line of text: "A small PDF.".

      Here goes the content of the uncompressed file. I've put it between the "readmore" and "code" tags as advised by you and others, I do hope it works (running the risk of being keelhauled if it isn't...):



      I just added it so that you could see the correct start locations of the different objects. The list of object starting positions can be found at the end of the above document (copy/paste it in an editor should reveal this...) and is also given below for clarity. There are apparently 11 COS (first line must be ignored) and e.g. the 3rd COS has an offset of 11269 bytes from the beginning of the file (that one is already after a section that has binary content in it).

      0000000000 65535 f 0000000015 00000 n 0000000422 00000 n 0000011269 00000 n 0000000218 00000 n 0000000522 00000 n 0000010335 00000 n 0000010537 00000 n 0000011029 00000 n 0000011236 00000 n 0000011326 00000 n 0000011477 00000 n


      Now for the dumped version of the above, I've used your script and sent the output to a file. This is the result (same here, used "readmore" and "code" to void the direct visibility of it):



      And finally, the code I've used to search for the different locations where x 0 obj occurs. x is a number going from 1 to - in this example - 11.

      To use it, first the above content (full or dumped code) has to be saved into a file of your choice and then call the script using perl sscce.pl <filename>.

      Since all is going well now, I was able to scan my whole PDF file and compared to the speed I had with Python to do the same, it's blazing fast in Perl!!! What took "forever" in Python - say, half an hour or so? dunno really, since I never got the patience to wait until it was finished - takes not even half a minute in Perl...

      Indeed, once again it's proven Perl is much, much better, powerful and faster for text processing than almost any other language (apart from pure C, I guess)!

      Thanks anyway for your patience and willingness to help me. Much, much appreciated! Others too!

      Best rgds,
      Geert

        I'm glad you got it figured out, and thanks for taking the time to make the SSCCE! We're here for Rubber duck debugging too ;-)

      U+1F44D 👍 THUMBS UP SIGN

      $ perl -e"use Path::Tiny; use Data::Dump; dd( path( shift )->slurp_raw + )" -- City_Slicker_Hayboxes.pdf do { require MIME::Base64; MIME::Base64::decode("JVBERi0xLjMKJe+/ve+/ve+/ve+/vQoKMSAwIG9iago8PC +AvVHlwZSAvQ2F0YWxvZwovT3V0bGluZXMgMiAwIFIKL1BhZ2Vz ******** i SNIPPED this ****** dHhyZWYKMjMyMQolJUVPRgo="); }
Re^5: Calculated position incorrect when using regex in text file that also contains binary info
by hippo (Chancellor) on Jun 21, 2020 at 18:08 UTC

    There's no concept of attachments here. If you have code, enclose it in <code>...</code> tags. If you have lots of code, enclose it in <readmore>...</readmore> tags but only after remembering that the first S in SSCCE stands for "short".

    See also Writeup Formatting Tips.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11118318]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others studying the Monastery: (2)
As of 2021-09-24 01:24 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?