http://www.perlmonks.org?node_id=1200743

NetWallah has asked for the wisdom of the Perl Monks concerning the following question:

Esteemed RegEx-Monkers:

In parsing a log file lines that contains XML-ish content with a single regex, I'm having trouble understanding the subtleties of optional capture.

The string I'm parsing is like:

<blah1 phase="2" type="MyType" more_keys="Values" <Unwanted/> <SomeTa +gIwant><k1="v1"></SomeTagIwant>
And I'm trying to extract the content of the "type", and the tag name of a tag that ends with "TagIwant".
The Tag may or may not be present.

I'm able to capture both pieces with the RE:

\btype="([^"]+)".+<(\w+TagIwant\b)
but - the match fails if I append a "?" to the expression, in an attempt to make it optional.
I.e. this fails:
perl -E '$x=q|<blah1 phase="2" type="MyType" more_keys="Values" <Unwa +nted/> <SomeTagIwant><k1="v1"></SomeTagIwant>|; say for $x=~/\btype +="([^"]+)".+<(\w+TagIwant\b)?/'
Which returns only "MyType", and not the second expected capture of "SomeTagIwant".

The "\b" is an attempt to deal with variations like <SomeTagIwant/> and <SomeTagIwant k3="v3" /> .

I'm hoping for (1) Explanations for why the "?" fails, and (2) Suggestions on how to fix it.

                All power corrupts, but we need electricity.

Replies are listed 'Best First'.
Re: Regex Optional capture doesn't
by haukex (Archbishop) on Oct 05, 2017 at 18:11 UTC

    I wouldn't call myself a full "regex expert" (I consider some other monks the real experts :-) ), so there might be a better way to do this, but I do know the trick to use a dot and negative lookahead to say "match anything, except this thing I don't want to match", similar to "[^"]+". This works:

    use warnings; use strict; use Test::More; my $re = qr{ \b type = "([^"]+)" (?: (?!<\w+TagIwant\b) . )* <(\w+TagIwant\b)? }msx; is_deeply [ q{ <blah1 phase="2" type="MyType" more_keys="Values" <Unwanted/> +<SomeTagIwant><k1="v1"></SomeTagIwant> } =~$re], ['MyType','SomeTagIwant']; is_deeply [ q{ <blah1 phase="2" type="MyType" more_keys="Values" <Unwanted/> +<SomeTagIwant/> } =~$re], ['MyType','SomeTagIwant']; is_deeply [ q{ <blah1 phase="2" type="MyType" more_keys="Values" <Unwanted/> +<SomeTagIwant k3="v3" /> } =~$re], ['MyType','SomeTagIwant']; is_deeply [ q{ <blah1 phase="2" type="MyType" more_keys="Values" <Unwanted/>} =~$re], ['MyType',undef]; done_testing;

    Update: Possibly interesting related reading: [OT] Thoughts on Ruby's new absent operator?. Also a few minor edits for clarity.

Re: Regex Optional capture doesn't
by LanX (Saint) on Oct 05, 2017 at 16:34 UTC
    I think .+ is greedy and eats the rest of the string.

    But the regex is forced to backtrack if <(\w+TagIwant\b) is not optional otherwise it concludes happily.

    Changing to .+? should solve this.

    UPDATE

    just tested, not my day. declaring myself officially (mentally) ill.

    UPDATE

    an elegant solution is buried at the end of this discussion here

    Cheers Rolf
    (addicted to the Perl Programming Language and ☆☆☆☆ :)
    Je suis Charlie!

      It does not.

      Same result with ".+?".

                      All power corrupts, but we need electricity.

        Yep, just tested.

        But I think the explanation is still the same. (point 1)

        The fix doesn't work because the regex has to decide which non-greedy ? has "precedence".

        I know there are ways to handles this in one regex, but my advice is just to use a second one checking the tail of the string. (point 2)

        Cheers Rolf
        (addicted to the Perl Programming Language and ☆☆☆☆ :)
        Je suis Charlie!

Re: Regex Optional capture doesn't
by Laurent_R (Canon) on Oct 05, 2017 at 17:47 UTC
    Hi NetWallah,

    Can't you use two separate regexes, one for the type and one for the "TagIwant"? This would be so much simpler.

      I agree - BUT - I have invested a couple of hours in this, and now it has become an obsessive challenge.

      Looking for relief before checking myself into rehab.

      Thanks.

                      All power corrupts, but we need electricity.

        I saw your response soon after you posted it, but had to go to a meeting outside and did not have time. Coming back home, I tried a couple of things and it turns out to be more difficult than I thought.

        Although this solution is far from elegant, it seems to work:

        DB<1> $x = q|<blah1 phase="2" type="MyType" more_keys="Values" <Unwa +nted/> <SomeTagIwant><k1="v1"></SomeTagIwant>|; DB<2> $y = q|<blah1 phase="2" type="MyType" more_keys="Values" <Unwa +nted/> <SomeTagIDontWant><k1="v1"></SomeTagIDontWant>|; DB<3> print "$_ " for $x =~ /type="([^"]+)".+?(\w+?TagIwant) | typ +e="([^"]+)"/x; MyType SomeTagIwant DB<4> print "$_ " for $y =~ /type="([^"]+)".+?(\\w+?TagIwant) | ty +pe="([^"]+)"/x;; MyType DB<5>
        It could be slightly improved with a named regex:
        DB<6> $type = qr/type="([^"]+)"/; DB<7> print "$_ " for $x =~ /$type .+?(\w+?TagIwant) | $type /x; MyType SomeTagIwant DB<8> print "$_ " for $y =~ /$type .+?(\w+?TagIwant) | $type /x; MyType
        Well, I think that haukex's solution below looks better.