Deriving Regular Expressionsby mojotoad (Monsignor)
|on Apr 18, 2002 at 21:19 UTC||Need Help??|
I recently had cause for deriving sets of regular expressions from a set of raw data and then using those regexps to extract a subset of the data -- sort of a uniq on steroids, if you will.
As usual there were some surprising complications along the way. In this case it was my ignorance about how the quotemeta() function actually works. This function is the same function employed when you escape meta-characters in double-quoted strings with the \Q and \E delimiters.
For the record, the documentation for quotemeta() reads, in part:
Returns the value of EXPR with all non-"word" characters backslashed. (That is, all characters not matching /[A-Za-z_0-9]/ will be preceded by a backslash in the returned string, regardless of any locale settings.) This is the internal function implementing the \Q escape in double-quoted strings.
(for more info on the details, see the Gory details of parsing quoted constructs in the perldocs)
Now consider the following problem. Derive patterns, at varying degrees of generality, that would match the following string (not the quotes): ' ab+cd (12)34'
Here are some answers, in order of increasing generality:
In my case, at one point I was interested in maintaining sequence lengths of alphanumerics but collapsing whitespace. However, the strings were likely to have special characters in them, like the '+' and parenthesis in the example above. So before I did anything I needed to escape those special characters before proceeding, since I wanted to avoid escaping meta-characters later in the process. Little known to me, however, was the fact that quotemeta() escapes all non-word characters -- that includes whitespace. My first quick approach looked like the following. Order is important here -- if we replaced digits and spaces first they would be mangled by our alpha-character replacement later:
Oops. What happened there? All of the whitespace clusters are now a literal backslash followed by one or more 's' characters. Not only that but there are more of them than there should be. That won't do.
Had I read the quotemeta documentation I would have known that the first time I escaped the string the spaces were each individually escaped since they aren't word characters. Drat.
Hence the solution that worked for me:
Much better. That's what I was looking for. Applied to the original data this new pattern would match, whereas the other would not.
This same problem applies to any non-word characters in a string. In this case it happened to be whitespace. The crux of the problem is this: "escape all special characters in what is to eventually become a regular expression -- if a character is normally interpreted as literal in a regexp then do not escape it." Are there more effective ways of deriving regexp patterns out there? I'm interested in hearing about them.