Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Weird Regex in CGI::Cookie

by enoch (Chaplain)
on Oct 15, 2003 at 23:21 UTC ( #299587=perlquestion: print w/ replies, xml ) Need Help??
enoch has asked for the wisdom of the Perl Monks concerning the following question:

I stumbled across a bit of code with the following loop in it:
foreach (@pairs) { s/\s*(.*?)\s*/$1/; . . . }
That regex sent up a warning signal in my brain.
  • Match 0 or more spaces greedily, followed by
  • Capture into $1, any character (other than a new line) 0 or more times non-greedily, followed by
  • Match 0 or more spaces greedily
  • Replace everything matched with what was caught in $1
Looking at it, I assumed the author was trying, albeit in a broken way, to strip out leading and trailing spaces. To further investigate its broken-ness, I wrote a small snippet:
push @strings, 'nospace'; push @strings, 'trailingspace '; push @strings, ' leadingspace'; push @strings, 'internal space'; push @strings, ' surroundedbyspace '; push @strings, ' spaces every where '; for my $wtf (@strings) { print "Before: '$wtf'\n"; $wtf =~ s/\s*(.*?)\s*/$1/; print "After: '$wtf'\n\n"; }
Which, output:
Before: 'nospace' After: 'nospace' Before: 'trailingspace ' After: 'trailingspace ' Before: ' leadingspace' After: 'leadingspace' Before: 'internal space' After: 'internal space' Before: ' surroundedbyspace ' After: 'surroundedbyspace ' Before: ' spaces every where ' After: 'spaces every where '
Which is, basically, what I expected. The regex is only replacing leading spaces... The things is, this code is located in CGI::Cookie in the raw_fetch subroutine (view the source here). Here is the code in the subroutine:
# Fetch a list of cookies from the environment or the incoming headers + and # return as a hash. The cookie values are not unescaped or altered in +any way. sub raw_fetch { my $class = shift; my $raw_cookie = get_raw_cookie(@_) or return; my %results; my($key,$value); my(@pairs) = split("; ?",$raw_cookie); foreach (@pairs) { s/\s*(.*?)\s*/$1/; if (/^([^=]+)=(.*)/) { $key = $1; $value = $2; } else { $key = $_; $value = ''; } $results{$key} = $value; } return \%results unless wantarray; return %results; }
So, is this regex doing something that I am missing, or is it a broken regex that was placed in a seldom-used subroutine that no one has bother correcting?
Can anyone shed some light...

enoch

update: Fixed a typo.

Comment on Weird Regex in CGI::Cookie
Select or Download Code
Re: Weird Regex in CGI::Cookie
by Abigail-II (Bishop) on Oct 15, 2003 at 23:40 UTC
    It looks very broken to me. It's also strange that the code is deleting leading whitespace, while the split uses "; ?".

    Abigail

Re: Weird Regex in CGI::Cookie
by davido (Archbishop) on Oct 16, 2003 at 00:33 UTC
    I just verified that CGI::Cookie v1.24 still contains that ineffectual regexp. Per the CGI::Cookie POD, "Address bug reports and comments to: lstein@cshl.org". Go for it. You might want to submit your report along with a recommended patch, if you want to get it implemented quicker.

    I know that Lincoln is pretty aggressive about stamping bugs out of his modules.

    The problem you're seeing is that the regexp engine is acting lazy. The following regexp seems to work for your test strings:

    s/\s*\b(.+)\b\s*/$1/

    But it's probably better to anchor the regexp to the start and end of string, to draw it out to match the entire string. Even then it still just is a slightly ambiguous way to do things; to rely on greed outweighing non-greed. Especially when there are better ways to do it.

    It is faster and more reliable to do it this way:

    s/^\s+//; s/\s+$//;

    Alternation could be used along with the /g modifier, but then you slow things down just a little.


    Dave


    "If I had my life to do over again, I'd be a plumber." -- Albert Einstein
      You might want to submit your report along with a recommended patch, if you want to get it implemented quicker.

      I second that, whole-heartedly. In fact, here's my suggestion for a better version (hoping I understand the intent correctly):

      sub raw_fetch { my $class = shift; my $raw_cookie = get_raw_cookie(@_) or return; my %results; $raw_cookie =~ s/^\s+//; $raw_cookie =~ s/\s+$//; my(@pairs) = split(/\s*;\s*/,$raw_cookie); foreach (@pairs) { my ($key,$val) = (/^([^=]+)=?(.*)/); $results{$key} = $val; } return \%results unless wantarray; return %results; }
      I've only tested it to the extent that the assignments to %results match the original code (except for the trailing whitespace glitch) -- that is, a "raw_cookie" string like this:
      one=1 ;two=2; three=the third ; four=  ; five  ; six = the sixth ;
      
      produces these key/value pairs:
      'one'=>'1',
      'two'=>'2',
      'three'=>'the third',
      'four'=>'',
      'five'=>'',
      'six '=>' the sixth'
      
      (note that spaces next to "=" are not deleted)
Re: Weird Regex in CGI::Cookie
by d_i_r_t_y (Monk) on Oct 16, 2003 at 00:59 UTC

    yes, it is broken. the reason is that the pattern is only matching the leading space and nothing else -- ie:

    $_ = ' blee blah '; s/\s*(.*?)\s*/$1/; print "matched '$&'; retained '$1'; str is '$_'\n";
    which gives matched '       '; retained ''; str is 'blee blah    '

    to strip whitespace, the minimal match needs to be forcibly anchored to match to the end of the string, ie:  s/^\s*(.+?)\s*$/$1/, which now successfully strips leading & training whitespace.

    however this is a poor way to do it. it is clearer and faster to use 2 regexps, one to strip leading \s, the other the trailing \s; ie:

    s/^\s+//; s/\s+$//;

    using 2 regexps is also quite a bit faster than the single regexp too:

    [matt@blade8 matt]$ perl -MBenchmark -e '@list1 = @list2 = map { " "x +rand(100) . "some text" . " "x rand(100) } 1 .. 100; timethese( 100_ +000, { single_regexp => sub { return map { s/^\s*(.+?)\s*$/$1/; $_ } + @list1 }, dual_regexp => sub { return map { s/^\s+//; s/\s+$//; $_ } + @list2 } } ); ' Benchmark: timing 100000 iterations of single_regexp, dual_regexp... single_regexp: 56 wallclock secs (56.04 usr + 0.00 sys = 56.04 CPU +) @ 1784.44/s (n=100000) dual_regexp: 17 wallclock secs (18.63 usr + 0.00 sys = 18.63 CPU) +@ 5367.69/s (n=100000)

Re: Weird Regex in CGI::Cookie
by enoch (Chaplain) on Oct 16, 2003 at 04:39 UTC
    I emailed in the following patch:
    --- /usr/local/lib/perl5/5.6.1/CGI/Cookie.pm +++ /usr/local/lib/perl5/5.6.1/CGI/Cookie.pm.new @@ -42,7 +42,7 @@ my(@pairs) = split("; ?",$raw_cookie); foreach (@pairs) { - s/\s*(.*?)\s*/$1/; + s/^\s+//; s/\s+$//; if (/^([^=]+)=(.*)/) { $key = $1; $value = $2;
Re: Weird Regex in CGI::Cookie
by hossman (Prior) on Oct 20, 2003 at 05:06 UTC
    So, is this regex doing something that I am missing, or is it a broken regex that was placed in a seldom-used subroutine that no one has bother correcting? Can anyone shed some light...

    I have a feeling that it is broken in the sense that it doesn't do what (i/you/others suspect) it was intended to do.

    However.... based on the cookie header format this code seems pretty unneccessary. It was probably added as a "sanity check" in case some external client/server was over-zealous with it's white space.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://299587]
Approved by chromatic
Front-paged by davido
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (11)
As of 2014-08-29 18:40 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The best computer themed movie is:











    Results (286 votes), past polls