This looks like someone sneezed and hit the keyboard

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: This looks like someone sneezed and hit the keyboard by Roger (Parson) on Feb 03, 2004 at 07:05 UTC
`use YAPE::Regex::Explain; $regex = qr/.*([\$#\%>~]\|\@\w~\$\|\\\[\\e\[0m\\\] \[0m)\s?/; print YAPE::Regex::Explain->new($regex)->explain;` [download] Read more... (4 kB) Let me show you my (colourful) command prompt... :-) which looks like: ($hostname)$fullpath> export PS1="\[\033[1;37m\](\[\033[1;32m\]`uname -n`\[\033[1;37m\])\[\0 +33[1;36m\]\$PWD\[\033[1;37m\]>\[\033[0m\] " # ANSI colour commands # \[\033[1;37m\] => set colour to white (37) # \[\033[1;32m\] => set colour to green (32) # \[\033[1;36m\] => set colour to cyan (36) # ... # \[\033[0m\] => set colour back to normal [download]	[reply] [d/l] [select]
Re: Re: This looks like someone sneezed and hit the keyboard by flyingmoose (Priest) on Feb 03, 2004 at 14:09 UTC
In one of my first projects using Perl (I had an internship at the time), someone wanted me to parse the output of top and another Unix tool. Both used ANSI codes, so I backed away. It looks like this guy didn't back away, so I'll give him a gold star for bravery. However, he looses his gold star for not commenting his code and using an ugly regex without the /x modifier. It's really amazing how many people (and in good open-source programs, too) forget to add line comments here at there when they could greatly help. I am not asking for flower-box style comments, just an occasional "now we parse the ANSI terminal prompt" kind of comment here and there. Long story short, people who never comment their code and modules implementations need to be shot :)	[reply]
Re: Re: Re: This looks like someone sneezed and hit the keyboard by ysth (Canon) on Feb 03, 2004 at 17:05 UTC
I have real trouble with remembering to use whitespace, comments, and /x in my regexs. I just don't have the habit (yet!) while writing code. Almost all the /x's that end up in my code are added after-the-fact. I'm almost ready to decide just to put an /x on all regexes (to help develop the habit), but I know that will get me strange looks.	[reply]
Re: Re: Re: Re: This looks like someone sneezed and hit the keyboard by flyingmoose (Priest) on Feb 03, 2004 at 19:27 UTC
Re: Re: This looks like someone sneezed and hit the keyboard by spartan (Pilgrim) on Feb 03, 2004 at 17:31 UTC
wow. I'd spend all my votes to upvote this for a week if I could. I've always thought of regexes as a sort of black art (and still do to a certain degree), and I've always wanted something that would just explain in plain english what the heck a regex means when you read it. This could be my ticket (and possibly MANY others as well) to finally get a grip on regexes. Very funny Scotty... Now PLEASE beam down my PANTS!	[reply]
Re: Re: Re: This looks like someone sneezed and hit the keyboard by hessef (Monk) on Feb 03, 2004 at 18:01 UTC
Have you considered reading the O'Reilly Press book "Mastering Regular Expressions" by Jeffrey E. F. Friedl? The first few chapters explain in detail how to read regexes, step by step.	[reply]
Re:This looks like someone sneezed and hit the keyboard by Anonymous Monk on Feb 05, 2004 at 09:00 UTC
Hi, There is an article on perl.com that might be useful: http://www.perl.com/pub/a/2004/01/16/regexps.html Regexes are a programs. It takes time to learn new languages and that's why it might look difficult at the begining. With time and practice that kind of regex become (almost) clear. Using /x and commenting is very important but having the right support from the tools you use is also important. Here is a little html document that show your regex colored. I couldn't get it to show directly in this answer so you'll have to copy past :-(: <HTML> <HEAD> <TITLE>Smed generated dump</TITLE> </head> <body bgcolor="#FFFFFF"> <FONT color=#000000 style="BACKGROUND-COLOR: #ffffff"> <br> </FONT> <FONT color=#000000 style="BACKGROUND-COLOR: #ffffff"> <br> </FONT> <FONT color=#f00000 style="BACKGROUND-COLOR: #ffffff"> / </FONT> <FONT color=#ffff00 style="BACKGROUND-COLOR: #ff0000"> .* </FONT> <FONT color=#ffffff style="BACKGROUND-COLOR: #ff0000"> ( </FONT> <FONT color=#ffff00 style="BACKGROUND-COLOR: #643296"> [\$#\%>~] < +/FONT> <FONT color=#000000 style="BACKGROUND-COLOR: #00ff00"> \| </FONT> <FONT color=#ff0000 style="BACKGROUND-COLOR: #ffff00"> \@ </FONT> <FONT color=#000000 style="BACKGROUND-COLOR: #afeeee"> \w </FONT> <FONT color=#000000 style="BACKGROUND-COLOR: #f0f0ff"> ~ </FONT> <FONT color=#ff0000 style="BACKGROUND-COLOR: #ffff00"> \$ </FONT> <FONT color=#000000 style="BACKGROUND-COLOR: #00ff00"> \| </FONT> <FONT color=#ff0000 style="BACKGROUND-COLOR: #ffff00"> \\ </FONT> <FONT color=#ff0000 style="BACKGROUND-COLOR: #ffff00"> \[ </FONT> <FONT color=#ff0000 style="BACKGROUND-COLOR: #ffff00"> \\ </FONT> <FONT color=#000000 style="BACKGROUND-COLOR: #f0f0ff"> e </FONT> <FONT color=#ff0000 style="BACKGROUND-COLOR: #ffff00"> \[ </FONT> <FONT color=#000000 style="BACKGROUND-COLOR: #f0f0ff"> 0 </FONT> <FONT color=#000000 style="BACKGROUND-COLOR: #f0f0ff"> m </FONT> <FONT color=#ff0000 style="BACKGROUND-COLOR: #ffff00"> \\ </FONT> <FONT color=#ff0000 style="BACKGROUND-COLOR: #ffff00"> \] </FONT> <FONT color=#000000 style="BACKGROUND-COLOR: #f0f0ff"> </FONT> <FONT color=#ff0000 style="BACKGROUND-COLOR: #ffff00"> \[ </FONT> <FONT color=#000000 style="BACKGROUND-COLOR: #f0f0ff"> 0 </FONT> <FONT color=#000000 style="BACKGROUND-COLOR: #f0f0ff"> m </FONT> <FONT color=#ffffff style="BACKGROUND-COLOR: #ff0000"> ) </FONT> <FONT color=#000000 style="BACKGROUND-COLOR: #afeeee"> \s </FONT> <FONT color=#f00000 style="BACKGROUND-COLOR: #f0f0ff"> ? </FONT> <FONT color=#f00000 style="BACKGROUND-COLOR: #ffffff"> / </FONT> <FONT color=#000000 style="BACKGROUND-COLOR: #ffffff"> <br> </FONT> <FONT color=#000000 style="BACKGROUND-COLOR: #ffffff"> <br> </FONT> </body> </HTML> [download] If your text editor supported this, you would have less problems getting in regexes. There few tools to work on regexes and they do the coloring as well. Cheers, Nadim (NKH).	[reply] [d/l]
Re: Re: This looks like someone sneezed and hit the keyboard by Anonymous Monk on Feb 09, 2004 at 23:39 UTC
my (colourful) command prompt... :-) Or, simpler, with \h instead of `uname -n`: `export PS1="\[\033[1;37m\](\[\033[1;32m\]\u@\h\[\033[1;37m\])\[\033[1; +36m\]\$PWD\[\033[1;37m\]>\[\033[0m\] "` [download] which shows (username@host)path>	[reply] [d/l]
Re: This looks like someone sneezed and hit the keyboard by davido (Cardinal) on Feb 03, 2004 at 07:06 UTC
I'm going to re-enter your RE as though it has the /x modifier so that it's easier to comment on what it's doing... here goes: `/ .* # Match any quantity of any character (or # none at all) ( # group together, and capture. [\$#\%>~] # match any one of the following: $#%>~ \| # OR \@\w~\$ # match literal @, a word character, and $ \| # OR \\\[\\e\[0m\\\] \[0m # match "\[\e[0m\] [0m" ) # end capturing / grouping. \s? # match a single optional whitespace /x # End regexp.` [download] So you put that all together and you get a regexp that will match a pretty wierd looking string. The following strings should match (and MANY others too): "`Hi, I'm Dave\[\e[0m\] [0m`" "`121#@$14asdf$`" "`@h~$`" Looks pretty peculiar to me. Dave	[reply] [d/l] [select]
Re: Re: This looks like someone sneezed and hit the keyboard by Theo (Priest) on Feb 03, 2004 at 15:41 UTC
Okay, I guess this is a newbie question ... It looks to me like the `/.` opening to the regex is greedy and would grab everything that was applied to it leaving nothing for the rest of the regex to match to. In otherwords, anything/everything would give a match. Other, wiser monks have not mentioned this, so I'm assuming I've missed something. Why isn't my assumption true? Update:* Thanks to bunnyman, ysth and MCS for their gentle instruction. -Theo- (so many nodes and so little time ... )	[reply] [d/l]
Re: Re: Re: This looks like someone sneezed and hit the keyboard by bunnyman (Hermit) on Feb 03, 2004 at 16:33 UTC
No, everything in the regex must match, not just the first part of it, and the part in the middle with the (one\|two\|three) must match too. The thing that you must remember is that regexes can backtrack -- if they get to the end of the string without having matched yet, they can go back a few letters and try again. So the .* part will first try to match the entire string, because it is greedy. Then the middle part (one\|two\|three) must match, but there is nothing left in the string, and we must backtrack and try again. First we try going one letter back, then two, and eventually we either find the match or we backtrack all the way to the start and then there is no match.	[reply]
Re: Re: Re: This looks like someone sneezed and hit the keyboard by MCS (Monk) on Feb 03, 2004 at 17:05 UTC
The reason that most people say you shouldn't use .* is because it can match nothing (or everything) so matching just .* is pointless because it will match everything (including nothing) However, if you were looking for "hi" some ammount of text and then "there" you could use: `$line =~ /hi.there/;` [download] and it would match. Of course it's greedy and might not be exactly what you wanted but there are times when it is needed. However, it is overused a lot and usually something better can be used. To answer your question though, /. doesn't grab everything because it has required stuff after that. If you try and match /.some text/ It has to find "some text" or it will fail. However, if you try and match something like: /.\d?/ it could match nothing since the \d is optional.	[reply] [d/l]
Re: Re: Re: This looks like someone sneezed and hit the keyboard by ysth (Canon) on Feb 03, 2004 at 16:59 UTC
Because for the match to succeed, one of the three (optA\|optB\|optC) options has to match. With the .* at the front, it will basically start at the end of the string and work backwards until it finds one of the alternates. The \s? at the end is useless though (unless $& is used).	[reply]
Re: This looks like someone sneezed and hit the keyboard by dws (Chancellor) on Feb 03, 2004 at 07:23 UTC
One way to make sense of your former employee's code is to use the /x modifier on the regex, which lets you throw in whitespace for formatting without screwing up semantics (except that you'll need to encode whitespace as \s -- thanks, grinder). With this, the regex will look something like `m/ .* ( [\$#\%>~] \| \@\w~\$ \| \\ \[ \\ e \[ 0m \\ \] \s \[ 0m ) \s? /x` [download] which basically says skip past as many characters as possible, then match one of three things a single character that is one of $ # % > ~ a four character string beginning with @, following by a single "word" character, followed by ~ and $ the string "\[\e[0m\] [0m" allow for an optional trailing space With the "one of these three things" going into the Perl variable `$1`. In short, this regular expression isn't matching what you think its supposed to be matching. The third alternative looks like it's inteded to capture an escape sequence of some sort. So yeah, it looks pretty messy.	[reply] [d/l] [select]
Re:x2 This looks like someone sneezed and hit the keyboard (whitespace semantics in /x REs) by grinder (Bishop) on Feb 03, 2004 at 08:28 UTC
use the /x modifier on the regex, which lets you throw in whitespace for formatting without screwing up semantics ... with one significant caveat: spaces lose their semantics. The RE `/foo bar/` is not the same as `/foo bar/x`. The latter is equivalent to `/foobar/`. And there is a space in the OP's RE, which will therefore be incorrect if /x is blindly applied. The choices are either to escape the space with a backslash (which is difficult to read, especially if the backslash-space winds up at the end of a line) or replace it with \s, which is not semantically equivalent (it can match tabs or newlines as well). There is always `[\ ]` but I'm not sure it's a win.	[reply] [d/l] [select]
Re: Re: This looks like someone sneezed and hit the keyboard by MCS (Monk) on Feb 03, 2004 at 16:54 UTC
I would bet the original author probably meant to group the or's together an in that case it could be made a little faster by changing the first ( to (?: (which makes it not save the match into $1) Unless of course there is a $1 soon after the regex... in that case it was probably meant to capture it.	[reply]
Re: This looks like someone sneezed and hit the keyboard by Sol-Invictus (Scribe) on Feb 03, 2004 at 10:54 UTC
this is expecting to parse a command line prompt which uses ANSI escapes: ANSI Color Codes in brief: 0 to restore default color 1 for brighter colors 4 for underlined text 5 for flashing text 30 for black foreground 31 for red foreground 32 for green foreground 33 for yellow (or brown) foreground 34 for blue foreground 35 for purple foreground 36 for cyan foreground 37 for white (or gray) foreground 40 for black background 41 for red background 42 for green background 43 for yellow (or brown) background 44 for blue background 45 for purple background 46 for cyan background 47 for white (or gray) background you use the above codes together with an escape sequence like this (re +place the '#' with the colour code of your choice) : \e[#m Once you've used an escape all subsequent text will be affected until +you use the reset escape \e[0m so if I want to format a part of a line of text, instead of: print "This boring old line of text was supposed to have red text\n"; do this: print "This new improved, brighter, more interesting line of text has +\e[31mred text\e[0m\n" ; If you want to use two escapes on the same piece of text use one of these ';' : \e[#;#m print "This \e[5m new improved \e[0m, \e[5m brighter \e[0m, more \e[5minteresting \e[0m line of text has \e[31;5mflashing red text\e[0m\n" ; [download] `\e[0m` (the reset string) requires escaping in the regex, or at least the '\' and '[' do. In pseudo code the regex would read like this : `match anything (.*) grab the rest up to \e[0m` [download] try running it on the couple of the ANSI formatted strings I gave as examples and you'll see how it's working	[reply] [d/l] [select]
Re: This looks like someone sneezed and hit the keyboard by MCS (Monk) on Feb 03, 2004 at 16:51 UTC
While it can be very intimidating because you don't know what it means, it's really not that hard to read. I would suggest the book "mastering regular expressions" by Jeffrey Friedl. It helped me make sense of what was once gibberish. Others have already explained what it does so I won't go over that again but I don't think regular expressions should be feared. A read through "Mastering Regular Expressions" should make a master out of anyone. I also want to say that I disagree with the notion that you should always use /x to make your code clearer. If you do, you are relying on what the comments say, not on what the regex says. To me, when you spread it out like that, it makes it easier to comment but harder to actually read and find errors. (in my opinion) I think it's all the whitespace around the regex that makes it harder for me to understand. I agree most code is under commented but /x tends to lead to overcommenting for people who don't really understand regex's. ps. I have no affiliation with the author or the publishers other than I bought the book and loved it	[reply]


Perl Monk, Perl Meditation
	PerlMonks