Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

What can regular expressions NOT do?

by kryten (Scribe)
on Apr 23, 2000 at 02:16 UTC ( #8624=perlquestion: print w/replies, xml ) Need Help??

kryten has asked for the wisdom of the Perl Monks concerning the following question: (regular expressions)

It seems like perl regexps can match pretty much anything, and can even be used for some mathematical trickery So what is it that they cannot do? Is there anything that is actually impossible to match?

Originally posted as a Categorized Question.

Replies are listed 'Best First'.
Re: What can regular expressions NOT do?
by turnstep (Parson) on Apr 25, 2000 at 23:35 UTC
    Regexps are bad at very complicated cases, especially when things are nested. The typical example is HTML: it's easy enough to make a simple regexp to grab tags, but to get it really right and account for all cases (comments, greater than signs within quotes within HTML tags, multiline tags, etc.) you outgrow the usefullness of a regular expression and really need something that will actually parse the string (or file, whatever) that contains the HTML. Luckily, there are modules that will do just that for you.
      Wipe your ass**** after you're done shxxing
Re: What can regular expressions NOT do?
by buzzcutbuddha (Chaplain) on Apr 25, 2000 at 15:18 UTC
    The question is not so much 'What Can't Regexp Do?', but when shouldn't you use it.
    There are times when processing strings where it is easier and faster to use a substr
    to grab what you need from input. This is especially true if you know that the length
    of the input elements does not change (ie a fixed-width flat text file...). I guess the
    analogy would be that if you wanted to dig postholes for a fence, you could use a shovel,
    which is powerful and has many different ways to use, but is not as efficient as a posthole
    digger. Don't get me wrong, I love regexps, but sometimes, they are not the best tool for
    the job. :)
Re: What can regular expressions NOT do?
by kryten (Scribe) on Apr 25, 2000 at 18:45 UTC
    It's just like the old adage:
    If all you have is a regexp, then everything looks like a /^[Nn]ail$/
    :)
Re: What can regular expressions NOT do?
by jbn (Initiate) on Apr 27, 2000 at 01:48 UTC
    you cannot write a regexp that will match any level of nested balanced parentheses. You can always write one to match a specific number of balanced ones, but not one that can match a non- fixed number of these. I think there is an explaination in the Friedl book.
      Let me see if I can make this make sense with digging out a text book...

      Regular expressions are implemented internally by a 'finite state automata'.

      If you've never heard the term, I'll attempt to explain it... Picture a group of circles interconnected by various lines. The circles represent the current state, and the lines represent the next state to go to if a give input is seen. One circle is the start state, and some other number of circles are 'end' states (you can have more than one).

      For a specific example try this:
      take three circles, label them 'start', '1', and 'end'
      draw an arrow from 'start' to '1', from '1' to 'end' and from 'end' to '1'
      label the arrows 'a','b' and 'c' respectively.

      Starting at 'start', take each character of input and follow the link with that label to the next state.
      If at the end of the input you're at the 'end' state, the this automata matches the input.
      If you have a character of input that you don't have a link for from you're current state, or you run out of input and aren't on a 'end' state, the the automata doesn't match the input.
      Using our example the following will match: 'abc','abcbc','abcbcbc'.
      And these will not: 'a', 'ab','abcb','abcd','ad', etc.
      (The regular expression for this automata would be /^a(bc)+$/)

      At each step the only thing the automata is concerned with is what state it is in, and what is the next character of input. The is no retained knowledge of what the previous characters were. Since finite state automatas have no 'memory' of what input they've seen before, they have no way of knowing if the correct number of ')' has been found.

      Hopefully that made sense, but was probably FMTYWTK
      /\/\averick

        Well, I actually asked for something similar ages ago on comp.lang.perl.moderated (I wanted to match balanced braces in C++ code) and was treated to the following solution which works quite handily on arbitrary-depth nesting for parens.

        $string = "((()()())"; # one unbalanced paren ($re = "\Q$string") =~ s/\\(\()|\\(\))/$1\\$1$2$2/g; my @a = eval { $string =~ $re }; die "Mismatched brackets in '$string'\n" if $@;

        -J.

Re: What can regular expressions NOT do?
by ton (Friar) on Apr 03, 2001 at 23:23 UTC
    Also cannot solve the "dangling else" problem. That is, if you're checking a piece of code with a bunch of "if" statements which may or may not be followed by "else" statements, it is impossible to write a regexp that will find all occurances of "elses" without matching "ifs" (e.g. the point at which the number of elses exceeds the number of ifs). Note that this is trivially easy to do with a loop.

    There are a bunch more things also; have to dig up my compiler books to list them, though :)

Re: What can regular expressions NOT do?
by I0 (Priest) on May 20, 2002 at 20:51 UTC
    But Perl's m// operator is a more expressive grammar than formal regular expressions.
    For example, \1 allows you to detect primes:
    print "Prime" if (1 x shift) !~ /^1?$|^(11+?)\1+$/
    And s/// can be used to detect balanced parenthesis:
    $_ = '(()(()()))'; (my $re=$_)=~s/((\()|(\))|.)/${[')','']}[!$3]\Q$1\E${['(','']}[!$2]/gs +; eval{/$re/}; print "Balanced" unless $@ =~ /unmatched/;
Re: What can regular expressions NOT do?
by Anonymous Monk on May 20, 2002 at 05:16 UTC
    It is because regular expressions merely have the power of finite state machines. These finite state machines have NO memory.... how could it match equal numbers of an indeterminate quantity without MEMORY. It can't. That's why you would need the power of a more expressive grammar ... that is that which can be expressed by a pushdown automota. With a pushdown, we have a "stack" of memory that we can save about each state. Clearly, you can see how the matched parenthesis problem becomes very simple with such a mechanism at hand.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://8624]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others contemplating the Monastery: (2)
As of 2021-09-27 22:23 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?