Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Re: Is it safe to use external strings for regexes?

by LanX (Sage)
on Oct 06, 2021 at 14:02 UTC ( #11137261=note: print w/replies, xml ) Need Help??


in reply to Is it safe to use external strings for regexes?

> My question is whether this is safe to do or not

I'm not sure if you ask if your code or if foreign regexes "are safe".

In the latter case, there are three issues I'm aware of

  1. code injection by string interpolation, like /@{[ do_evil() ]}/
  2. code injection by regex, like /(?{ do_evil() })/
  3. exponential time regexes with excessive backtracking, something like /((x*)*)*/ IIRC

the first two cases might be solved by introspection/blacklisting regex-ops first, the latter probably only by experimenting with a hard limit on runtime.

NB: it's even possible to "hide" a BEGIN block inside a regex, we had this discussion about 10 years ago, I'll update a link.

Edit: We had regularly similar discussions over the years, you might want to Super Search the archives.

Cheers Rolf
(addicted to the Perl Programming Language :)
Wikisyntax for the Monastery

updates

) here --> Re: Vulnerabilities when editing untrusted code... (Komodo)

) more at regex-explosive-quantifiers

Replies are listed 'Best First'.
Re^2: Is it safe to use external strings for regexes?
by dave_the_m (Monsignor) on Oct 07, 2021 at 08:23 UTC
    In the latter case, there are three issues I'm aware of
    1. code injection by string interpolation, like /@{ do_evil() }/
    2. code injection by regex, like /(?{ do_evil() })/
    3. exponential time regexes with excessive backtracking, something like /((x*)*)*/ IIRC </ol?
    String interpolation of variables only happens for literal regexes in the source code. So if the pattern is read from a file or database this isn't an issue.

    Embedded code within a pattern is only allowed within the scope of use re 'eval'; otherwise trying to compile such a regex from a string will die at run time.

    The third one is a genuine issue, in terms of both CPU and memory usage.

    Dave.

      > So if the pattern is read from a file or database this isn't an issue.

      As I said "In the latter case" of general vulnerabilities, these are some issues to be aware of.

      The OP said

      > > These regexes are in the dozens, and are scattered across several scripts and libraries.

      > > maintenance of these mappings is easier.

      I doubt the general case can be solved with a DB of simple strings. Maintainable regexes are composed of smaller ones by interpolation and dynamic compilation. Which brings us back to start.

      > is only allowed within the scope of use re 'eval';

      with "newer" Perls yes. I noticed that you changed it around 2013, and am thankful for that. *

      > The third one is a genuine issue, in terms of both CPU and memory usage.

      well some regex engines optimize sometimes better than Perl's.

      I remember a demo of a case with nested quantifiers where unix' grep did very well and Perl waited for the end of times.

      This could be eased by analyzing the regex for potential traps like listed here and warning accordingly.

      This analyze could be done by parsing the compilation with re 'debug';

      But again this could open the door for those general vulnerabilities, that's why I prefer to point to them.

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      Wikisyntax for the Monastery

      ) for completeness TheDamian published a static parser for perl regexes, I can't tell how closely it incorporates new features.

      *) Some IDEs do perl -c on default when they open a perl file. Sending a troyan script with a evil BEGIN block will execute instantly after opening. And obfuscation with Acme::EyeDrops will still allow hiding the evil logic into a regex, one just needs to add use re 'eval'; for newer Perls

        > is only allowed within the scope of use re 'eval'; with "newer" Perls yes. I noticed that you changed it around 2013, and am thankful for that. *
        Um no, "use re 'eval'" has always been required to allow non-literal code blocks in patterns. The big "re eval" rewrite in 5.18.0 just made it smarter, so that for example a literal (and thus safe) code block could be interpolated into a run-time regex without needing the "use re 'eval'":
        use re 'eval'; # ** no longer needed from 5.18.0 onwards $r = qr/xyz/; /(?{ foo() })$r/;

        Dave.

Re^2: Is it safe to use external strings for regexes? (use Safe)
by LanX (Sage) on Oct 06, 2021 at 16:21 UTC
    FWIW: there is the Safe module to disallow certain Op-codes inside a (r)eval.

    use Safe; $compartment = new Safe; $compartment->permit(qw(time sort :browse)); $result = $compartment->reval($unsafe_code);

    Unfortunately I couldn't find a way to disable compiletime blocks like BEGIN and there doesn't seem to be another way to disable or override BEGIN...

    I'd love to be corrected.

    UPDATE

    oh Keyword::Simple could do the trick :)

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery

      It's indeed possible to bend the parser in a way that it thinks BEGIN and family are subs

      use strict; use warnings; use Keyword::Simple; sub no_begin ($&){ warn "no_begin(@_) called"; } my @code; BEGIN{ my @compile_blocks = qw(BEGIN UNITCHECK CHECK INIT END); for my $block (@compile_blocks) { # bend parser Keyword::Simple::define $block, sub { my ($ref) = @_; substr($$ref, 0, 0) = "no_begin '$block', sub"; }; # test code push @code , <<__CODE__; $block { die "owened by $block" } __CODE__ } } BEGIN { die "owened by BEGIN" }; UNITCHECK { die "owened by UNITCHECK" }; CHECK { die "owened by CHECK" }; INIT { die "owened by INIT" }; END { die "owened by END" }; eval join "\n", @code;

      -*- mode: compilation; default-directory: "d:/tmp/pm/" -*- Compilation started at Wed Oct 6 19:04:01 C:/Strawberry/perl/bin\perl.exe -w d:/tmp/pm/KW_simple_regex_BEGIN.pl no_begin(BEGIN CODE(0x694268)) called at d:/tmp/pm/KW_simple_regex_BEG +IN.pl line 6. no_begin(UNITCHECK CODE(0x6556d0)) called at d:/tmp/pm/KW_simple_regex +_BEGIN.pl line 6. no_begin(CHECK CODE(0x6942b0)) called at d:/tmp/pm/KW_simple_regex_BEG +IN.pl line 6. no_begin(INIT CODE(0x6c8aa0)) called at d:/tmp/pm/KW_simple_regex_BEGI +N.pl line 6. no_begin(END CODE(0x6c8c38)) called at d:/tmp/pm/KW_simple_regex_BEGIN +.pl line 6. Bareword found where operator expected at (eval 5) line 5, near "} CHECK" (Missing operator before CHECK?)

      But unfortunately does evaling the code not catch parsing errors anymore... (reason here BEGIN{} blocks don't need a trailing semicolon)

      so the answer is:

      • Yes BEGIN* blocks can be disabled.
      • But this is best done in an extra process

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      Wikisyntax for the Monastery

Re^2: Is it safe to use external strings for regexes?
by Anonymous Monk on Oct 06, 2021 at 22:50 UTC
      I also took the meat of this question to be about accepting user input, then throwing that into a regex - only to say, don't trust user input directly - as always. There's only one mention of taint in this whole thread, and I am replying to it. :-)

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11137261]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others meditating upon the Monastery: (7)
As of 2022-05-25 13:34 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Do you prefer to work remotely?



    Results (90 votes). Check out past polls.

    Notices?