Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change

regex for identifying encrypted text

by skendric (Novice)
on May 16, 2018 at 10:06 UTC ( #1214624=perlquestion: print w/replies, xml ) Need Help??
skendric has asked for the wisdom of the Perl Monks concerning the following question:

I write scripts which compare two text files and then do interesting things if they are different.
use Text::Diff qw(diff); [...] $diff = diff "$config_dir/$config_old", "$config_dir/$config_new", { STYLE => "OldStyle"}; @diff = split '\n', $diff; [...]
Typically, I want to ignore certain changes ... in the example below, I am uninterested in lines which contain the string 'set password ENC'. I end up writing code like:
LINE: for my $line (@diff) { next LINE if $line =~ /set password ENC/; [...] }
Now, I'm discovering that I am uninterested in changes to private keys ... a typical line in a file might look like this:
set private-key "-----BEGIN ENCRYPTED PRIVATE KEY----- MIIFDjBABgkqhkiG9w0BBQ0wMzAbBgkqhkiG9w0BBQwwDgQInXCep+2zzpgCAggA MBQGCCqGSIb3DHMHBAiSZZZ3CUL1cQSCBNhxHiU0wI3XOMU05aVZybU6OOJOJBa/ M+b28ad6P8VZiN+eToUfs3pTg+VqzAc273fdnZPZFMClXpJk8kQZv0ruEoA99RqE pgsnYGVxzZNmDy5HT3yBDGjRCssDnQ8QUBqabFCpW6d7fzilw9PnoHjFRmLxKnNE [...]
I'm struggling to figure out how to ignore such lines. My brain wants to construct a regex which identifies "random strings", so that I could write a line like:
next LINE if $line =~ /{looks like random stuff to me}/;
(1) Suggestions on how to construct such a regex?
(2) Suggestions on how to tackle the problem differently?


Replies are listed 'Best First'.
Re: regex for identifying encrypted text
by Eily (Prior) on May 16, 2018 at 10:22 UTC

    That looks like Base64 encoding. There are clues that can help you identify those lines (64 chars wide, except maybe the last, the chars are only those allowed by Base64 (no space), etc...) but since you have "BEGIN ENCRYPTED PRIVATE KEY", I'm guessing you might also have an END. If that's the case, the better solution might be to ignore all the lines between those two tokens. Using the "" version of the .. operator, this could be something like: next LINE if ($line=~/BEGIN ENCRYPTED/)..($line=~/END ENCRYPTED/);

Re: regex for identifying encrypted text
by hippo (Abbot) on May 16, 2018 at 10:28 UTC
    (1) Suggestions on how to construct such a regex?

    Very difficult to avoid false positives because the last line of the key may only be a few characters.

    (2) Suggestions on how to tackle the problem differently?

    Text::Diff will happily work on arrays/scalars as well as files, so pre-process your inputs to remove any private keys before doing the diff - that way they are easy to identify.

Re: regex for identifying encrypted text
by QM (Parson) on May 16, 2018 at 10:18 UTC
    I can think of 2 options:

    1) Write a regex that recognizes the full encryption multiline blob. This will require modifying how lines are defined, etc.

    2) That's pretty much it, unless you want to have the odd false positive, and decide that any string longer than 30 chars without whitespace or certain punctuation is actually part of a key. And also, not catch certain fumble finger changes where some such string was accidentally introduced in a comment or other free-form, non-parsed section. (I'm assuming that such a change in code would fail to parse, and would be caught fairly quickly.)

    Do you have a list of valid characters in private keys and such?

    Quantum Mechanics: The dreams stuff is made of

Re: regex for identifying encrypted text
by james28909 (Deacon) on May 16, 2018 at 16:44 UTC
    Can you give an example file that you parse? I would say you could just change the input record separator to paragraph mode eg  local $/ = ""; when you find "set private-key ", then read one more time with <DATA>. I am not sure if enc data ends with '[...]' or "\n\n" or '"' or what.

    please post an example file.
    #here is a small test i did with the examples given in OP use strict; use warnings; while(<DATA>){ if ( /set private-key/){ print "\nsetting input seperator to paragraph mode\n"; local $/ = ""; #record separator will change itself back to de +fault AFTER leaving the if block print "skipping encrypted data\n"; <DATA>; print "encrypted data skipped and should not be printed !\n\n" +; next; } print if /\w+/; } __DATA__ random data more random data blah blah one last test BEFORE! encrypted data! set private-key "-----BEGIN ENCRYPTED PRIVATE KEY----- sdfkjghsdlkhfgldkfjghldkfjgh sdflkjgdfgl;kd;lfkgjdlfkgjd;l dlkjfghlkdfjghldskfjhgldskfjhg this is AFTER encrypted data more rand0m data this is the last test AFTER! encrypted data.
    The idea here is to read the file/s one line at a time until you find the desired string. then once you find that string, change the input record separator to stop at the end of the enc data (if possible). if enc data is always the same length, then you could find the desired string and then read that static length every time.
    EDIT: Cleaned up post... a little.

      To my reading, the OP implies that the set private key command occurs in a single line which also includes the cipher payload complete with both brackets.   In that case, I think that I would expand upon my previously-suggested regex to include the command, the space and double-quote, and the cipher header bracket ... then proceeding non-greedily to the corresponding footer-bracket and the final quote, knowing that thereby you have got it all.   Substitute an empty-string for the whole thing, and the problem is solved.   Multi-line logic would be only slightly more complicated, and all of it can be buried within a subroutine so that the rest of the program doesn’t have to deal with it.   Because the brackets are there and can be relied-upon, the OP does not have to come up with a “regex for identifying encrypted text.”

      Just my two cents ...

Re: regex for identifying encrypted text
by cavac (Deacon) on May 16, 2018 at 12:22 UTC

    So the problematic block of lines start with set private ", then some junk that doesn't containt quotes and then it ends with a quote, right?

    You seem to be parsing this line-by-line, correct? So, modifying your code to make a very rudimentary state machine, i'd guess something like this would do:

    my $isprivatekey = 0; for my $line (@diff) { if($isprivatekey) { if($line =~ /\"/) { # last line of private key $isprivatekey = 0; } next; } if($line =~ /set\ private\-key/) { # uh, get some useless stuff here $isprivatekey = 1; next; } next LINE if $line =~ /set password ENC/; [...] }

    This should skip the whole private key block altogether

    "For me, programming in Perl is like my cooking. The result may not always taste nice, but it's quick, painless and it get's food on the table."

      The problem with this approach is that the line with "set private" is the same in both files and therefore will not be featured in the diff. The data between these markers must be removed before the diff is performed.

Re: regex for identifying encrypted text
by sundialsvc4 (Abbot) on May 16, 2018 at 15:21 UTC

    Very-fortunately, you don’t need to do this, and therefore ought not attempt it.   Cipher material always begins and ends with very distinctive strings such as:
    which always occur on their own line with nothing else on that line, specifically to facilitate either parsing-out or excluding that material.   The next marker will be the corresponding end-marker, also by itself on its own line.

    Encrypted content will also always be similarly bracketed.   You do not have to guess that it is “encrypted text.”   The markers will tell you.

    All of this is formally defined as RFC 5958 Asymmetric Key Packages.

    A regex such as /\-\-\-\-\-BEGIN ENCRYPTED PRIVATE KEY\-\-\-\-\-(.*?)\-\-\-\-\-END ENCRYPTED PRIVATE KEY\-\-\-\-\-/s will reliably match and capture the cipher material that may be found within a very large string.   The only gotcha in this case is that the pattern must be non-greedy (as shown), so that in a string containing many such blocks it will stop at the next occurrence of the end-marker instead of consuming everything up to the very last occurrence of that marker.   There is no need to look for the material itself, e.g. to somehow recognize Base64 encoding.   If a begin-marker occurs, you can rely on the fact that an end-marker will always be present and that the two will correspond.   Everything in-between the two markers will be nothing but cipher material of the specified kind, and you don’t have to consider how it might be encoded or structured.

    (Depending on your exact regular-expression, you may need to be sure that you correctly consider what “newline” sequence is being used in the data.   See perldoc perlrebackslash.)

    If you need to process as a single string stuff that might contain cipher blocks that you don’t want, just use a s// regex to change the entire thing, markers and all, to an empty-string.   (Do it “globally” s///g if you want to get them all in one swell foop.)

    If you might be processing a file line-by-line but want to exclude cipher blocks, a simple “fetch the next line of data” subroutine could be devised which will detect key-block markers and loop over (and discard) the markers and data, returning the line (if any) which follows them.   You can use a simple eq here:   the marker will always begin in column #1, will always be uppercase-only ASCII, and will end with a newline immediately following the last dash.

    If you need instead to collect the cipher material into a string, you should collect both markers as well as the lines in-between them, with \n newline characters in-between each as well as at the end.

      /\-\-\-\-\-BEGIN ENCRYPTED PRIVATE KEY\-\-\-\-\-(.*?)\-\-\-\-\-END ENCRYPTED PRIVATE KEY\-\-\-\-\-/ will reliably match and capture
      at least it will reliably compile. But . doesn't match \n unless you have the /s option for your regex. You were just one character away from code that actually works.

        “Well, foo!”   Edited the post.   Is that right now?   Or, if you prefer, post a correction.   Thanks.

        (Anyhow, mostly I was driving for the general idea:   that you can and should rely on the brackets and use them for this intended purpose.)

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1214624]
Approved by Eily
Front-paged by Corion
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others rifling through the Monastery: (4)
As of 2018-05-21 06:09 GMT
Find Nodes?
    Voting Booth?