http://www.perlmonks.org?node_id=699256


in reply to Re: Regex problems using '|'
in thread Regex problems using '|'

Okay, I posted some. What I don't understand is if I just use /Remediation Report\n\n(.+?)\n/, I get the line I'm looking for (Adobe Flash Player Multiple Vulnerabilities - April 2008 - IE), but if I use /Remediation Report\n\n(.+?)\n|^(.+?)\n/, I get the top of the metadata (thread-index: AcjoCau17Ri90HMJR8qoukn2A1g7ng==).

Replies are listed 'Best First'.
Re^3: Regex problems using '|'
by moritz (Cardinal) on Jul 22, 2008 at 10:44 UTC
    The regex engine starts at the start of the string, and tries to match the first alternative, here Remediation Report\n\n(.+?)\n. It doesn't match, so it tries the second alternative, ^(.+?)\n. That one matches, so it captures the first line in $1.

    Without the alternation, the regex engine moves its starting position until it finds the substring Remediation Report.

    (Actually it's much smarter than that; it searches for the constant substring with the same techniques that index uses, but from a users point of view that only matters when it comes to speed, not in terms of functionality).

      I'm not sure why the match is failing with the alternation when it matches correctly without it. I understand that if the first pattern does not match, it will go on to the second half . That's the behavior I want with regards to the rest of the records. Why is the match failing? That's what I do not understand, since it works correctly as long as the alternation is removed.
        If I understood your earlier reply correctly, the regex does match (with the alternation), but it doesn't match the way you want. That's a big difference, and what I tried to explain to you.
        That's the behavior I want with regards to the rest of the records.

        After looking at the updated data I think that you need two regexes for that:

        use strict; use warnings; my $str = do { local $/; <DATA> }; if ($str =~ m/Remediation Report\n\n(.+?)\n/g){ print $1, $/; while ($str =~ m/\n\n(.*)\n/g){ print $1, $/; } } __DATA__ thread-index: AcjoCau17Ri90HMJR8qoukn2A1g7ng== MIME-Version: 1.0 # rest of data goes here

        The output is:

        Adobe Flash Player Multiple Vulnerabilities - April 2008 - IE Adobe Flash Player Multiple Vulnerabilities - April 2008 - Mozilla/Ope +ra Adobe Reader/Acrobat 8.1.2 and 7.1.0 Update - Acrobat 7.x

        The trick is to use the /g-modifier on the first regex although it matches only once. That way pos $str will not be reset, and the next regex match starts where the previous left off.

        Also note that ^ will anchor to the start of the string (not to the start of a line) unless the /m modifier is present.