Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

regex is not working as I intended

by fireblood (Acolyte)
on Jan 18, 2018 at 22:20 UTC ( #1207486=perlquestion: print w/replies, xml ) Need Help??

fireblood has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks and fellow followers,

I am unable to understand why my regex is not working the way I intended. I'm trying to write a regex that recognizes when a parameter value is enclosed in single or double quotes and pulls out the value that lies between the quotes, or when the value is not quoted it simply returns the value. Here is my code:

use strict; my $regex = qr / # start of regex ( # start of capturing alternation (?<=\') # positive lookbehind to a single quote ( # start of capture buffer 2 .*? # the value between the single quotes ) # end of capture buffer 2 (?>\') # positive lookahead to a single quote | # or (?<=\") # positive lookbehind to a double quote ( # start of capture buffer 3 .*? # the value between the double quotes ) # end of capture buffer 3 (?>\") # positive lookahead to a double quote | # or ( # start of capture buffer 4 .* # any unquoted value ) # end of capture buffer 4 ) # end of capturing alternation /x # end of regex ; &do_test (\"Now is the time"); # test with unquoted value &do_test (\"'Now is the time'"); # test with single quoted value &do_test (\'"Now is the time"'); # test with double quoted value sub do_test { print "\n"; if (${$_[0]} =~ /$regex/) { print "\$1 is $1.\n"; print "\$2 is $2.\n"; print "\$3 is $3.\n"; print "\$4 is $4.\n"; } else { print "No match.\n"; } }
When I run this, I get:

$1 is Now is the time. $2 is . $3 is . $4 is Now is the time. $1 is 'Now is the time'. $2 is . $3 is . $4 is 'Now is the time'. $1 is "Now is the time". $2 is . $3 is . $4 is "Now is the time".

Why are the first two alternatives not capturing quoted test strings?

Replies are listed 'Best First'.
Re: regex is not working as I intended
by GrandFather (Sage) on Jan 18, 2018 at 23:11 UTC

    I'm not sure why, but the last alternate ((.*)) seems to win in all cases when the other alternates use a look behind. However, things are much easier to understand if you look around less:

    use strict; use warnings; my $regex = qr/(' ([^']*) ' | " ([^"]*) " | (.*))/x; do_test (qq~No quote~); do_test (qq~'Single quote'~); do_test (qq~"Double quote"~); sub do_test { my ($line) = @_; print "\n"; if ($line =~ $regex) { print "\$1 is $1.\n" if defined $1; print "\$2 is $2.\n" if defined $2; print "\$3 is $3.\n" if defined $3; print "\$4 is $4.\n" if defined $4; } else { print "No match.\n"; } }

    Prints:

    $1 is No quote. $4 is No quote. $1 is 'Single quote'. $2 is Single quote. $1 is "Double quote". $3 is Double quote.

    Note too various other tidy ups in the code, especially avoiding calling subs with & (which doesn't do what you think) and excessive use of \.

    Premature optimization is the root of all job security

      I'm not sure why,

      At position 0,

      1. (?<=\')(.*?)(?>\') can't possibly match (there can't be a ' before the first character),
      2. (?<=\")(.*?)(?>\") can't possibly match (there can't be a " before the first character), but
      3. (.*) always matches.

      Since it matched at position 0, it doesn't try to match at position 1 (where one of the first two alternates has a chance of matching).

        I forgot to mention that this regex is used repeatedly on the same string to parse out parameter and value pairs, such as the following:

        radius = 3, density = .014, URL = "https://www.geometry.org", max_no_of_attempts = 4

        That is why the lookbehind and lookahead assertions are used, even though they cannot possibly match at the beginning or end of the overall parameter string, they can match at intermediate positions as the regex is used to walk over the value of the parameter string parsing out individual parm and value pairs.
Re: regex is not working as I intended
by hippo (Chancellor) on Jan 18, 2018 at 23:23 UTC
    Why are the first two alternatives not capturing quoted test strings?

    Because your position is still at the string start so you can't use a lookbehind (or if you do it will never match). Simpler just to use this:

    use strict; use warnings; use Test::More tests => 3; my @strings = ( 'Now is the time', '"Now is the time"', "'Now is the time'", ); my $regex = qr/^['"]?(.*?)['"]?$/; for my $input (@strings) { my ($match) = ($input =~ /$regex/); is $match, 'Now is the time', "$input matched"; }
      my $regex = qr/^['"]?(.*?)['"]?$/;

      Note that this regex will also match oddballs like  q{'Now is the time} and  q{'Now is the time"}


      Give a man a fish:  <%-{-{-{-<

Re: regex is not working as I intended
by AnomalousMonk (Bishop) on Jan 18, 2018 at 22:50 UTC

    Not a complete answer, but just a note:  (?>...) is not a positive lookahead (which would be  (?=...) instead), but is an "atomic" grouping (update: also called an "independent" subexpression) that affects backtracking. See Extended Patterns in perlre.


    Give a man a fish:  <%-{-{-{-<

      Oh, yeah, that was a typo. Thanks for catching.
Re: regex is not working as I intended
by BillKSmith (Prior) on Jan 19, 2018 at 04:25 UTC
    It is much easier to test the three cases separately.
    use strict; use warnings; use Test::More tests=>3; use Regexp::Common 'RE_ALL'; my @cases = ( [q(Now is the time), 'unquoted'], [q('Now is the time'), 'single quoted'], [q("Now is the time"), 'double quoted'], ); foreach my $case (@cases) { $_ = $case->[0]; my $string = /\"(.*)\"/ ? $1 : /\'(.*)\'/ ? $1 : /[^"'].*[^'"]/ ? $& : 'No Match' ; ok($string eq 'Now is the time', "$case->[1] found $string" ); } C:\Users\Bill\forums\monks>perl fireblood.pl 1..3 ok 1 - unquoted found Now is the time ok 2 - single quoted found Now is the time ok 3 - double quoted found Now is the time
    Bill
Re: regex is not working as I intended
by Laurent_R (Canon) on Jan 18, 2018 at 22:56 UTC
    Unless I missed something, I don't see why you need to use look ahead and look behind assertions and the like here.

    It seems to me that this should work correctly (but can't test now on my mobile device):

    my $regex = qr / "([^"]+)" | '([^']+)' | (.+) /x;

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1207486]
Approved by GrandFather
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others taking refuge in the Monastery: (6)
As of 2020-04-06 22:59 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    The most amusing oxymoron is:
















    Results (42 votes). Check out past polls.

    Notices?