Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

Re: Calculated position incorrect when using regex in text file that also contains binary info (updated)

by haukex (Bishop)
on Jun 14, 2020 at 09:11 UTC ( #11118039=note: print w/replies, xml ) Need Help??


in reply to Calculated position incorrect when using regex in text file that also contains binary info

Just two quick thoughts for now: open the file with the mode '<:raw' (or binmode the handle after opening but before reading), and add the /aa regex modifier (perlre) to your regexen.

Update: I didn't have enough time earlier to explain why I made these two suggestions, so let me do that now. First, note that on Windows, the :crlf PerlIO layer is active by default, translating CRLFs to LFs, which isn't good for binary data. Also, opens without explicit layers can be affected by the open pragma. Second, if you were to open the file with an encoding layer, under Unicode matching rules (see /u in perlre), \d matches any Unicode digits (see "Digits" in perlrecharclass).

Update 2: Typo fix

  • Comment on Re: Calculated position incorrect when using regex in text file that also contains binary info (updated)
  • Select or Download Code

Replies are listed 'Best First'.
Re^2: Calculated position incorrect when using regex in text file that also contains binary info (updated)
by geertvc (Sexton) on Jun 17, 2020 at 04:36 UTC
    Hello haukex,

    Sorry for the late reply, but I was for a few days not able to work on that one.

    I've followed your advice, opened the file in raw mode. Before, I was using the module use Path::Tiny; and then path('<file.pdf>')->slurp_raw; to open the file in raw mode. I guess that's the same behaviour?

    Anyway, I followed your advice and came up with the following test program:

    use v5.010; use strict; use warnings; # Reading in file in raw format. local $/; open F, "<:raw", "input.pdf" or die $!; my $raw_content = <F>; my $nr_of_cos_objects = 10; my @counter = (1..$nr_of_cos_objects); my $position = 0; for my $number (@counter) { my $result = $raw_content=~qr/^${number} 0 obj/aa; if ($result) { say "Object item [$number] ('${number} 0 obj') starts at posit +ion [$-[0]]"; } else { say "Object item [$number] ('${number} 0 obj') start position +not found"; } if ($result) { say "Object item [$number] ('${number} 0 obj') ends at posit +ion [$+[0]]\n"; } else { say "Object item [$number] ('${number} 0 obj') end position +not found\n"; } } say "End test program. Bye...";
    Result: nothing is found at all when using /aa (or /a). This is the output:
    Object item [1] ('1 0 obj') start position not found Object item [1] ('1 0 obj') end position not found Object item [2] ('2 0 obj') start position not found Object item [2] ('2 0 obj') end position not found Object item [3] ('3 0 obj') start position not found Object item [3] ('3 0 obj') end position not found Object item [4] ('4 0 obj') start position not found Object item [4] ('4 0 obj') end position not found Object item [5] ('5 0 obj') start position not found Object item [5] ('5 0 obj') end position not found Object item [6] ('6 0 obj') start position not found Object item [6] ('6 0 obj') end position not found Object item [7] ('7 0 obj') start position not found Object item [7] ('7 0 obj') end position not found Object item [8] ('8 0 obj') start position not found Object item [8] ('8 0 obj') end position not found Object item [9] ('9 0 obj') start position not found Object item [9] ('9 0 obj') end position not found Object item [10] ('10 0 obj') start position not found Object item [10] ('10 0 obj') end position not found

    The /m is apparently indispensable in this regex setup, so I also tried /ma and /maa combinations. When doing this, I get the results back, but incorrect (same results as my very initial attempts...).

    Object item [1] ('1 0 obj') starts at position [19] Object item [1] ('1 0 obj') ends at position [26] Object item [2] ('2 0 obj') starts at position [235] Object item [2] ('2 0 obj') ends at position [242] Object item [3] ('3 0 obj') starts at position [344] Object item [3] ('3 0 obj') ends at position [351] Object item [4] ('4 0 obj') starts at position [667] Object item [4] ('4 0 obj') ends at position [674] Object item [5] ('5 0 obj') starts at position [2663] Object item [5] ('5 0 obj') ends at position [2670] Object item [6] ('6 0 obj') starts at position [3139] Object item [6] ('6 0 obj') ends at position [3146] Object item [7] ('7 0 obj') starts at position [3514] Object item [7] ('7 0 obj') ends at position [3521] Object item [8] ('8 0 obj') starts at position [3839] Object item [8] ('8 0 obj') ends at position [3846] Object item [9] ('9 0 obj') starts at position [5063] Object item [9] ('9 0 obj') ends at position [5070] Object item [10] ('10 0 obj') starts at position [5501] Object item [10] ('10 0 obj') ends at position [5509]

    Best rgds,
    Geert
      When doing this, I get the results back, but incorrect (same results as my very initial attempts...)

      Show xref-table fragment for objects 1-10, or, better yet, provide a link to the test file. + Your approach to PDF hacking is seriously flawed, listen to what AM says and use proper API (CAM::PDF). You don't need to manually touch, contract or edit xref-table after deleting an object because it's done for you automatically, that's what API's for. Only pay attention that deleteObject is among "deeper utilities" for a reason -- one generally doesn't need to call it neither; unused objects will be cleansed for you automatically, too.

        Hi,

        For sure I will take a look at the CAM::PDF module, I stated that in another reply here somewhere. But I would like to know why you say my approach of PDF hacking is seriously flawed? Can you explain?

        As I also replied somewhere else in this thread, I have Python code that works perfect and does the job 100% correct on the same file content, but it's extremely slow. That's the reason why I would like to give it a try with Perl, seen it outperforms many other languages with respect to text manipulation (wasn't this one of the main reasons Perl was developed in the first place?).

        So, I'm puzzled as to why my approach is flawed. Curious to hear/read your rationale behind this...


        Best rgds,
        Geert

      Sorry, my reply is quite late as well.

      The /m is apparently indispensable in this regex setup, so I also tried /ma and /maa combinations.

      Yes, that's what I meant, sorry - often people will say e.g. "the /m and /s modifiers" to distinguish modifiers visually from m// and s/// or other things, but that doesn't mean to say those are the only modifiers that should be applied to the regex.

      I've followed your advice, opened the file in raw mode. Before, I was using the module use Path::Tiny; and then path('<file.pdf>')->slurp_raw; to open the file in raw mode. I guess that's the same behaviour? Anyway, I followed your advice and came up with the following test program

      Yes, it's the same behavior. You didn't show that in your original node, and for this node you've shown what looks to be a complete script, but you're not showing your input (in a format compatible with text-only display, such as hexdump -C input.pdf or od -tx1c input.pdf) or your expected output, leaving us to guess what the issue is. This is why Short, Self-Contained, Correct Examples are so important, so that we can reproduce the issue. Please show: Short, representative sample input, a runnable piece of code, the expected output for the input, and the actual output, including any error messages.

        Hi haukex,

        I really have to learn to be more precise (and maybe concise too???) in my answers. Really sorry for that, I'm feeling a bit uncomfortable and embarrassed now...

        I will prepare such SSCCE example (again learned a new abbrev...) ASAP, promised. Could you tell me how to attach a file to a message, if that's allowed?

        Thanks for taking the time to point me to my weak points, I can only learn from this...

        Best rgds,
        Geert

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11118039]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others romping around the Monastery: (2)
As of 2021-10-26 05:47 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    My first memorable Perl project was:







    Results (90 votes). Check out past polls.

    Notices?