ccn has asked for the wisdom of the Perl Monks concerning the following question:

I have read perlre, perlretut, perlop but I have not found a rule of interpolating \1, \2, \3 ... inside of a character class.

Experiment shows that \1 in a character class is interpolated as octal \001 rather than a backreference.

print "'\001' match\n" if "'\001'" =~ /(')[\1]\1/; print "''' not match\n" unless "'''" =~ /(')[\1]\1/;
It prints the both lines on my perl v5.8.3

Is this behaviour documented anywhere?

Replies are listed 'Best First'.
Re: \1, \2, \3, ... inside of a character class
by mirod (Canon) on Aug 12, 2004 at 17:36 UTC

    I have come accross that problem before, \1, \2... are not interpolated inside a character class (or at least not properly). You can use them _outside_ of the class though: (.)([\w]|\1) for example should work fine.

Re: \1, \2, \3, ... inside of a character class
by dragonchild (Archbishop) on Aug 12, 2004 at 17:36 UTC
    I'm not going to answer the question as asked (smarter people will do that). However, I'm wondering what you're doing that requires this. It looks like you're trying to match mis-quoted items. Why aren't you using something like Text::Balanced or Regex::Common?

    We are the carpenters and bricklayers of the Information Age.

    Then there are Damian modules.... *sigh* ... that's not about being less-lazy -- that's about being on some really good drugs -- you know, there is no spoon. - flyingmoose

    I shouldn't have to say this, but any code, unless otherwise stated, is untested

      Personally I have no need to use backreferences in character classes. But I saw the following code:
      sub eraseCommet { my($all, $comment) = @_; return $all if !$comment; } s/(<(\/)?((!--)|(script)|(style)|\w+)(?(4).*?-->|(\s+\w+(?:\s*=\s*(["' +])?(?(8)[^\8]+?\8|\S+?(?=[>\s])))?)*?\s*\/?>(?(5).*?<\/script>|(?(6). +*?<\/style>))))/eraseCommet($1,$4)/gixse;
      That weird regex (I don't like it) was written by one person who want to eliminate comments from HTML. He knew about HTML::Parser, but he wanted to make it with regexps for his fun. I was trying to find a valid HTML where that code fails and I noted that he used \8 as a backreference within a character class. I knew that one can use variables [$var] but using of [^\8] appeared alarming for me. Such way I obtained that probably undocumented behavior.
Re: \1, \2, \3, ... inside of a character class
by davido (Cardinal) on Aug 12, 2004 at 17:55 UTC

    Here is one way to build a character class based on a backreference. Note, I've had to use the (??{....}) construct, and I'm not positive (without diving again into the gory details of parsing) whether I'm relying on defined behavior or just happenstance. But it works!

    use strict; use warnings; while ( my $string = <DATA> ) { chomp $string; if ( $string =~ m/(\w+)\s((??{"[$1]+"}))/ ) { print "$string => matched: $1, $2!\n"; } else { print "$string => Didn't match.\n"; } } __DATA__ abcde fgh abcde eadcabe

    With that snippet, the first line will fail to match, and the second line will succeed, because the second half of the second line contains only those characters found in the first subset. This could probably be accomplished with greater simplicity by just breaking it down into smaller regexps that cascade from one to the next, but I couldn't resist the challenge of doing it in one.

    Update: Having just re-read perlre, I'm satisfied that I'm relying on defined (though "experimental") behavior. The (??{...}) subexpression is a sort of postponed regular subexpression, and it should have full access to the $n ($1, $2, etc.) special variables for any parens that have actuall matched so far. I could also have written the (??{...}) subexpression as

    ...because $^N is the same as the $1, $2, etc. variables but contains the most recent successful capturing subexpression.


Re: \1, \2, \3, ... inside of a character class
by BrowserUk (Pope) on Aug 12, 2004 at 18:00 UTC

    I've never seen it documented, but what would it mean if they were interpolated as back references?

    By which I mean, if the back reference in question captured two or more characters, would that mean required the character class to match any one of those characters?

    I think that the problem would be that character classes are compiled before the regex begins to run. Ie. before the capture has taken place.

    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "Think for yourself!" - Abigail
    "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon
Re: \1, \2, \3, ... inside of a character class
by gmpassos (Priest) on Aug 12, 2004 at 20:49 UTC
    As you said, you want \1 working as the first captured occurrence in a character class ([...]). But \1 can capture much more than a single character, so, it can't be used inside a character class because it doesn't represent a character, but a string. To match multiple strings just use (?:...|\1|...).

    Graciliano M. P.
    "Creativity is the expression of the liberty".

      \1 can capture much more than a single character, so, it can't be used inside a character class because it doesn't represent a character, but a string

      It's not a problem, see perlretut

      /[\]c]def/; # matches ']def' or 'cdef' $x = 'bcr'; /[$x]at/; # matches 'bat', 'cat', or 'rat' /[\$x]at/; # matches '$at' or 'xat' /[\\$x]at/; # matches '\at', 'bat, 'cat', or 'rat'
        Ok, what you want is \1 working as a group of chars. Well, we can't, but you can use some complexer code:
        $var = "abXaaabbbc" ; $var =~ s/(..)X(??{"[$1]+"})/${1}#/ ; print "$var\n" ;

        Graciliano M. P.
        "Creativity is the expression of the liberty".