Re: [Try-out] Regexp do's and don'ts
by Dietz (Curate) on Aug 15, 2004 at 09:56 UTC
|
5. Do know what your regex really means.<br>
• Do know about ^, $, variable interpolation, the matching rules as de
+scribed by the Camel Book, modifiers, and the meaning of \n (newline)
+ in combination with ., ^ and $
Do know about precedence since disregarding it is one of the highest crime in regex country:
I've made a sample common mistake not paying attention to precedence
and I use this as a chance to pillory myself giving a perfect example of what not to do.
In node Re: Short or Long Hand I was using an alternation being based on anchors:
/^0|6$/
This simply says match 0 at the beginning or match 6 at the end
while my intention was to match 0 or 6 ranging from beginning to the end of the string:
/^(?:0|6)$/
As an addition to your tutorial I'd like to see the basic requirement for regexes:
• You can't write an efficient regex as long as you don't know what your expected data will be:
Always think of the expected data while changing or simplifying regexes.
I personally like the term 'regexpected'
Keep it in mind and it will save your life ;-)
Please feel free to downvote node Re: Short or Long Hand
/me castigating myself for not paying attention to precedence
| [reply] [d/l] [select] |
Re: [Try-out] Regexp do's and don'ts
by gumpu (Friar) on Aug 15, 2004 at 09:37 UTC
|
Hoi,
Good stuff! Has the potential be very useful for newbies.
One point:
"5. Don't use regexes for formats without a definite syntaxis, like human language. Regexes are good for pattern matching (and substitution), not for langague analyzing."
Think you have to be a bit more specific here.
Computer languages have a definte syntax but using regular expressions to parse them can be a very painful process. (I Know this from experience cause I once tried to make a code beautifier for C++ and Pascal). In those cases a proper parser (say Parse::RecDescent) is much better.
If you are just trying to find simple things in source code, for instance #include statements or simply formatted comment blocks, regular expressions would be fine.
| [reply] |
|
as long as you don't define better as faster, yes, it is much better :D
| [reply] |
Re: [Try-out] Regexp do's and don'ts
by exussum0 (Vicar) on Aug 15, 2004 at 14:37 UTC
|
Do give examples for any point you ever make when reading these types of documents. In some ways, you are giving advice, but if you give reasons with solid examples and counter examples, you make it that much stronger.
Points 1 and 2 is easy. Using /i is an efficiency thing. Show benchmarks. If the benchmarks show no difference, then the point isn't valid.
Point 7 ticked me off at a particular company, where a few people who would do just that. Show a good example, maybe with a system call or file handle that shows how this, as an exploit, would work.
It's the difference between "don't smoke" and "don't smoke, it increases your chances of cancer"
----
Then B.I. said, "Hov' remind yourself
nobody built like you, you designed yourself"
| [reply] |
Re: [Try-out] Regexp do's and don'ts
by BrowserUk (Pope) on Aug 15, 2004 at 15:49 UTC
|
In my (somewhat devalued) opinion, there is only one DO... and one DONT... worthy of note.
- DON'T tell others what they should or should not do.
- DO explain to others why you choose to do (or not) certain things (in particular ways).
For bonus points, also explain why you (or others) might choose use the proscribed behaviour (or not use the advised behaviour) under some circumstances. Make particular emphasis upon explaining the deciding factors that would sway your decision against your norm.
Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail
"Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon
| [reply] |
Re: [Try-out] Regexp do's and don'ts
by demerphq (Chancellor) on Mar 28, 2005 at 10:55 UTC
|
You should recommend people avoid constructs like: [Jj][Aa][Vv][Aa] as they are quite inefficient and also can blow out various optimizations just by their presence. Its better to write that (?i:Java). Also up until 5.9.2 perl doesnt optimise alternations very well so its advisable to use modules like Regexp::List or the like to preprocess
/Lists|of|words/
. OTOH as of 5.9.2 perl _does_ optimize them so using things like Regexp::List will only slow down your patterns (im hopeful by 5.10 these modules will be updated to Do The Right Thing Regardless™).
In fact if at all possible after that version it is recommended that you use alternations instead of using quantifier, bracketing. Ie, /(cars|cart|carry|car)/
will be more efficent that /(car([st]|ry)?)/
as of 5.9.2, and in some circumstance massively more efficient.
I admit i wrote the optimization so im tooting my own horn here a bit. :-) But it is worth realizing that alternations in later perls can be signifigantly faster than other hypothetically equivelent patterns.
| [reply] [d/l] [select] |
|
I am fully aware of the fact that m/[Jj][Aa][Vv][Aa]/ sucks like hell. I just needed a "complex" regex which had a clear goal, in order to demonstrate multi-lining regexes. But I'll add a note: don't try this at home :)
| [reply] [d/l] [select] |
Re: [Try-out] Regexp do's and don'ts
by ikegami (Pope) on Sep 25, 2004 at 06:28 UTC
|
| [reply] [d/l] [select] |
Re: [Try-out] Regexp do's and don'ts
by muba (Priest) on Aug 15, 2004 at 11:06 UTC
|
| [reply] [d/l] |
Re: [Try-out] Regexp do's and don'ts
by muba (Priest) on Aug 16, 2004 at 11:24 UTC
|
original content (version 0.2, 0.2.1) deleted and moved to root node. Old version 0.1 deleted from thread at all. Sorry for the polution. | [reply] |
|
Thanks MUBA,
I was actually searching something on Perl's Regular Expressions. This node has been God (read "Perl")sent to me.
Regards
Sameet
| [reply] |
Re: [Try-out] Regexp do's and don'ts
by nobull (Friar) on Mar 27, 2005 at 22:13 UTC
|
Good document. I can hardly fault it technically.
A few suggestions (I am a native speaker of (British) English):
"look how cute, he's telling the obvious."
"look how cute, he's stating the obvious."
to search global
to search globally
The plural of 'regex' is sometimes written as 'regexen'. This is not a standard way of making a plural in English but it still seen quite often.
The wisdom of using strictures and taint mode has little to do with regexes. The whole section "Rules of Thumb" should be introduced as general Perl programming advice not RE specific advice. Perhaps even split this section off to a separate node.
tainted date
tainted data
they are/behave malicious
Can't make this work comfortably in English. They are malicious or they behave maliciously.
Don't trust users.
...or programs under their control.
getting president
standing for president
Check if you enter the right airplane before entering it.
Check if you are entering the right airplane before you enter it.
if both are not present.
Sorry the precedence of 'not' here is ambiguous in written English. (In spoken English it would be possible to disambiguate with intonation).
- if neither are present.
- unless both are present.
- if both are absent.
($untainted) = $tainted =~ m/(.*)/g;
($untainted) = $tainted =~ m/(.*)/;
(And you just finished warning people not to use redundant qualifiers) :-)
This way, you only show you don't know why one would use taint mode and you make taint mode useles for your script.
This could be considered insulting. There's nothing wrong with unconditionally untainting data that is known for certain to come from a trusted source.
| [reply] [d/l] [select] |
|
Good document. I can hardly fault it technically.
A few suggestions (I am a native speaker of (British) English):
Alright, thank you! I edited the Original Post in order to use most of your suggestions.
($untainted) = $tainted =~ m/(.*)/g;
($untainted) = $tainted =~ m/(.*)/g;
(And you just finished warning people not to use redundant qualifiers) :-)
Oh, I had a really hard time finding out what the difference between the two lines of code is. But indeed. I altered it, so now it is ($untainted) = $tainted =~ m/(.*)/; :)
Well, thanks!
| [reply] [d/l] [select] |
Re: [Try-out] Regexp do's and don'ts
by ww (Archbishop) on Sep 24, 2004 at 19:38 UTC
|
one reader's take: you have the makings of a very good regex article here...
my quibbles: the editorial matter (re strict, warnings, etc) before you get to regex issues might well be split off -- perhaps multiple splits, as others have suggested. Seems to me that the title would be annoyingly misleading, otherwise.
As for your use of English -- you have used some constructs that vary from the forms used by those whose first language is "American English" but almost none (see first list item below) of them obstruct understanding or present any serious obstacle to "ease of reading."
coupla' specifics: (updates, 20050328 and new language suggestions below (waaay! far down))
- Has been addressed:
para 4: "lined out equally" might be more colloquially phrased "aligned vertically." (problem is that "lined out" is synonymous with "struck out.")
- Suggest you expand para 8 with some examples, explanations. Advising one to "RTFM" is sometimes all well and good, but in an article with a tutorial intent, it seems to me to verge on rudeness to the reader (and lest that be a mystery, because (1) it's a contraction for a phrase which uses a word offensive to some and (2, more important) TFM is extremely dense and sometimes -- depending on the reader's learning style -- difficult to absorb. You might consider offering links several choices of explanatory material including but not limited to Owl, Friedl, TFM and others (Yes, I'm one such and would love to find the "other" that sings to me.)
- repeated below
"syntaxis" ??? Unfamiliar, not found in a quick dictionary check. Suspect you intend "syntax."
- Suggested addition: brief discussion of the ephemeral character of $1 ... (ie, reset issues) $1 is mentioned in para 6.
Hope this is some help. Please, drive on with your good work! ++
| [reply] |
Re: [Try-out] Regexp do's and don'ts
by muba (Priest) on Aug 15, 2004 at 21:42 UTC
|
| [reply] [d/l] |
Re: [Try-out] Regexp do's and don'ts
by ww (Archbishop) on Mar 28, 2005 at 13:58 UTC
|
muba: good stuff.
You may wish to consider the following (mostly minor and occasionally open to debate) re the idiom or syntax:
In Introduction, "Note: this is not a regex tutorial or regex howto." (emphasis supplied) s/or /nor/
likewise, s/If you may ever find /If you ever find / (may)
Jargon: "Before I finnaly start off, let's set some terminology." -- for spelling change to "finally"; for idiom: just omit it entirely.
Rules of Thumb 2: I'm intruding into content here, but I'm troubled by the statement, "when input from external sources may be unsafe." My view: input from external sources is ALWAYS unsafe... even if it's coming from me. No malice is required: "Fat fingeritis" can wreak havoc!
RoT 2: "...etc) is considered 'tainted'." s/is/are/ for subject-verb agreement in quantity; typo: "Also, thinks like" s/thinks/things/.
also in RoT 2, for brevity: "There are several ways to untaint data, which I am not about to mention here. You should check the above mentioned Perl Security (perlsec) manpage." could be written, "There are several ways to untaint data for which you should check the above-mentioned...."
RoT3: "They are ignorant or else they are malicious." would be less globally applicable to (all) users) if you said, "Some are ignorant; some are malicious." (As written, the current phrase indicts ALL users.) and
"...number from 1 to 5, including, you..." s/including/inclusive/
"On the other hand," means (in this useage) that what follows is intended as a counter-example, whereas what actually follows is a supporting or additional example. Suggest one way to improve it would be to drop the quoted phrase, or (and the grammar stiffs will be object to this, replace "OTOH" with "Or"
typo: s/easiliy/easily/
RoT 5. "syntaxis" -- I think you want "syntax" and
"analysis" instead of "analyzing."
RoT 6. spelling: s/shuld/should/
RoT 7. "Do use CGI; (have..." might be clearing if you were to say "Do use CGI:; (have..." or, even better, if you specified the module by its full name
and: "CGI offers you a great amount of functions" can be better phrased "CGI offers you many functions"
and in "...as good as the module's author." s/good/well/ (good is an adjective, well is the adverb form).
If you find these useful (msg me), I'll carry on with the rest of the document.
Again, ++ | [reply] |
|
Thank you!
I used all but one of your suggestions to improve this document. Most things you pointed out where stupid mistakes (which I of course fixed), others were things I'd never find out.
Again, thank you!
| [reply] [d/l] |
Regexp Legibility
by patrickhaller (Initiate) on Jun 14, 2009 at 05:21 UTC
|
I used to (before use strict) compose regexps by hiding the components inside an if, e.g.:
$complex_re = /^($ip) ($host) ($msg)$/ if (
$ip = /\d+\.\d+\.\d+\.\d+/,
$host = /[\-\.\w]+/,
$msg = /.*/
);
Nowadays, I use strict and eval, e.g.
my $complex_re = eval {
my $ip = qr/\d+\.\d+\.\d+\.\d+/;
my $host = qr/[\-\.\w]+/;
my $msg = qr/.*/;
return /^($ip) ($host) ($msg)$/;
);
We pay about a 5% penalty for the eval when we use this in a tight loop, however we can solve that by moving the regexp creation outside the loop.
Rate with eval without eval
with eval 116279/s -- -5%
without eval 121951/s 5% --
Patrick | [reply] [d/l] [select] |
|
| [reply] [d/l] [select] |
|
Looks like eval runs faster...
Rate as usual with do
as usual 63091/s -- -29%
with do 88496/s 40% --
Rate with eval as usual
with eval 59773/s -- -9%
as usual 65359/s 9% --
| [reply] [d/l] |
|
|