Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?
 
PerlMonks  

Perl regex in real life

by RezaRob (Initiate)
on Oct 31, 2008 at 07:04 UTC ( #720635=perlquestion: print w/replies, xml ) Need Help??

RezaRob has asked for the wisdom of the Perl Monks concerning the following question:

Dear Perl Monks, can you give some _real-life_ examples where "Perl-specific" regex features (like backreferences or non-greedy matching, perhaps) would be indispensable? Thanks a lot. Reza.

Replies are listed 'Best First'.
Re: Perl regex in real life
by moritz (Cardinal) on Oct 31, 2008 at 07:15 UTC
    Many languages come with regular expression features similar to Perl, so I wouldn't call backreferences or non-greedy quantifiers "Perl-specific".

    Anyway, there are lots of the "more advanced" features (the ones you named, look-ahead and look-behind assertions, non-backtracking groups...) that I use now and then in regex that ease my work. I just don't think of them as being special, so I don't remember the applications.

    In "Mastering Regular Expressions" there's a nice example: A regex for detection accidentally duplicated words words in text. For that you need backreferences (something like m/\b(\w+)\s+\1\b/)

    There are other features that are more unique to perl, like the "keep assertion" \K (new in perl-5.10, see perlre for details). That's very useful when you want to delete everything after a regex match: s/$regex\K.*//s

Re: Perl regex in real life
by smiffy (Pilgrim) on Oct 31, 2008 at 09:39 UTC

    I'm not sure that I fully understand your question, but would point out that the pcre (Perl Compatible Regular Expressions) C library is used as part of PHP for preg_match(), preg_replace(). (PHP aping it's betters again ;-))

    pcre's are also used in applications such as the Postfix mail system. Part of my own anti-spam solution for my Postfix mail servers uses pcre's for message header and body checks. For reference, here are my current bodychecks and headerchecks files. (For anyone thinking that my regex syntax is weird, I've done them like this so that I can read them easily.)

      Thanks for posting your filters. I guess what I mean is that apparently in real-life backtracking and related features aren't required. In fact, I find them a source of distraction and bugs in my regexes. For instance, if you want to parse C source code, you might say something like:
      /((?:const\s+)*)\s+(\w+)\s+(\w+);/
      Now suppose the C code has a bug, and it defines a variable like this:
      const int;

      Then your three regex captures look like:
      $1==''
      $2=='const'
      $3==int

      which clearly wasn't the intended result.

      This happens because of backtracking for instance.

      If you look at regex history, people had NFA/DFA machines from computer science theory, and they said a "language" is whatever passes through these machines. So that's what regexes matched. Nowadays, Perl has actually introduced things like non-backtracking constructs, namely '(?>'; essentially, acknowledging the "problem". However, I'm not quite sure why non-backtracking shouldn't be all you need in a "real-life" situation.

      Thanks a lot for everyone's comments.

      Reza.

        Neither backtracking nor captures are really a problem needing to be fixed. Features like (?>) were added to help extend the sorts of things regular expressions can do.

        To take your example, regular expressions are not the problem with why the wrong thing was matched. Your expression allowed that interpretation (or it would have with some minor changes). The problem is that you are using an expression that does not properly cover the case you claim to be looking for.

        To take another example that I think shows why these features are more useful, let's match a US telephone number. A telephone number in the US can take many forms:

        • 445-7890
        • 445 7890
        • 4457890
        • 713 445-7890
        • (713) 445-7890
        • 713-445-7890
        • 7134457890
        • 713 445 7890

        And that leaves out adding a 1 or 0 for long distance and extensions, which people often give as part of the number.

        Matching this set of expressions requires optional characters which (if you are doing captures) requires backtracking. (Not really, but the implementation gets hairier if we discuss that part.)

        So to match a phone number, we would need:

        m{ ( (?: \( \d\d\d \) \s* ) | \d\d\d (?: -? | \s* ) ) ? \d\d\d (?: - | \s* ) \d\d\d\d ) }x;

        Obviously, this appears somewhat complicated and there is quite a bit of possibility for confusion. In this case, however, the problem is not the regex, it's the fact that the phone number format is specified fairly sloppily.

        In fact, the times that I have often found the features you are questioning most useful are when I'm dealing with real world data. Because unlike the stuff (insert pompous tone) I generate, the real world is messy and inconsistent.

        One of the nastiest problems I ever tried to solve was to extract tables of information from text files generated by people at various companies. You have no idea how many weird variations that people can come up with that a person can interpret, but are almost unparseable by computer. Without many of these features, we would not have gotten as far as we did.

        G. Wade
        You see, in the old days, a regex only returned a boolean: match or not-match. Later, people got serious about actually capturing the matched sections, and they introduced backtracking (versus NFA machines) to get that sort of thing done.

        But then, the actual _algorithm_ and internals of what's happening becomes crucial. You now care about more than just a theoretical boolean result: match or not-match.

        When internally it matters what the engine is doing, backtracking makes it very hard and unnatural to predict and to think about. That's not how "real" humans actually think about their mother tongue. They don't seem to seriously "backtrack" in their brains while reading a book.

        So, are there real-life examples of where that's needed?

        Reza.
Re: Perl regex in real life
by JavaFan (Canon) on Oct 31, 2008 at 09:50 UTC
    Well, I use regular expressions a lot, but I wouldn't call Perl-specific regex features "indispensable". If I wasn't using Perl, I'd still be solving the same problems, just using a different language. So, I can't even call Perl itself indispensable, let alone a feature of a feature of the language.
Re: Perl regex in real life
by blazar (Canon) on Nov 01, 2008 at 10:46 UTC

    I personally believe that we may have different interpretations of the "real life" expression: for example, in real life a problem I've often had to face is that of dating with girls. In this respect, none of the "Perl-specific" regex features ever helped me, and neither did [Pp ]erl as a whole although I can imagine some convoluted and highly improbable situations in which they would. Now, due to my severe health conditions, I am coping with intense pains and precisely, in the very instants I'm writing this, especially at my left leg: (to the point that it's very hard to write at all,) needless to say, those things do not help in any way either.

    So, my point is that there is not such a thing as a "real life" IT/CS problem, which appears to be what you're referring to. There are actual problems, period. And there's that Turing thingie which I'm not repeating here, which states that if your programming language is powerful enough, then it will be equally able as any other one to solve them. Then there are tools, and there are tools which are similar to others, but slightly more or slightly less powerful than those others. Thus you may have a situation in which some particular Perl's enhanced regex feature would not be "indispensable" but the absence of which would make your regex say ten time longer and several times more error prone: how indispensable would that feature be for you?

    To only take into account the simplest of examples, since you talk about non-greedy matching, see my recent Insane (?) Regexp-based jpeg (JFIF) extractor... which uses the extremely simple /(\xFF\xD8 .*? \xFF\xD9)/xsg regex: how would you have done that without non-greedy matching? Please note that however ridiculous, the problem I was trying to solve with that program was a real world one in your sense, namely that of "extracting those images from that file ASAP."

    On to the future: one oft repeated mantra at the Monastery and elsewhere is that regexen are not well suited to parse HTML and XML, which is perfectly true. But under Perl 6 they will be so enhanced as to be promoted to "rules" which in turn, it is said, will be perfectly apt at implementing real parsers (although I'm not an expert, and not even a beginner, in parsing theory and I don't know which kind of parsers...) Ain't this a real-life problem?

    --
    If you can't understand the incipit, then please check the IPB Campaign.
      /(\xFF\xD8 .*? \xFF\xD9)/
      Like a great friend of mine, you do talk a lot off track!

      But your example is absolutely correct :-) This _is_ backtracking. However, it is exactly equivalent to
      /(\xFF\xD8 .*? (?>\xFF\xD9))/
      So, in a sense, it isn't _actually_ backtracking. Complicated backtracking would be when you cannot do that.

      On to the future: one oft repeated mantra at the Monastery and elsewhere is that regexen are not well suited to parse HTML and XML, which is perfectly true. But under Perl 6 they will be so enhanced as to be promoted to "rules" which in turn, it is said, will be perfectly apt at implementing real parsers (although I'm not an expert, and not even a beginner, in parsing theory and I don't know which kind of parsers...) Ain't this a real-life problem?
      Indeed, parsers are real life problems. And I did realize that in PerlFAQ there is an entry:
      Can I use Perl regular expressions to match balanced text?

      That's "real life" enough.

      You'll realize though, that when you're parsing (complicated) languages, you'll need to do things like error checking. So, in practice, either you'll need to make the regex un-humanly complex, or it still must contain "outlets" in form of function callbacks etc. in order to handle special conditions and the logic and semantics of the language that you're parsing.

      I can see that the regex features mentioned above are useful, but I still don't feel very cozy about their beauty and simplicity. It _is_ however clear, that Perl will be at least as useful (and better) than traditional parsers (yet probably not quite as fast, unless you really understand internal implementation issues.

      Thanks a lot for bringing up these thoughts.

      Reza.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://720635]
Approved by moritz
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (4)
As of 2019-08-21 00:44 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?