http://www.perlmonks.org?node_id=969212

fiverivers has asked for the wisdom of the Perl Monks concerning the following question:

Please help me find out why the regex below does not match the full table name

I have a file with sql queries. I am trying to extract table names from the file.

sample of data in the file

update users set timezone='Europe/London' where uid not in (0); update field_data_field_location set field_location_value = NULL where + field_location_value='select'; update field_revision_field_location set field_location_value = NULL w +here field_location_value='select'; update field_data_field_profession set field_profession_value = NULL w +here field_profession_value = 'select'; update field_revision_field_profession set field_profession_value = NU +LL where field_profession_value = 'select';

The following code does not match the last character of the table name

while(<FILE>){ $line = $_; if ($line =~ m/^update\ (\w*)[^\s]/){ print $1."\n"; } }

output

user field_data_field_locatio field_revision_field_locatio field_data_field_professio field_revision_field_professio

working regex

while(<FILE>){ $line = $_; if ($line =~ m/^update\ (\w*)\ [^\s]/){ print $1."\n"; } }

correct output

users field_data_field_location field_revision_field_location field_data_field_profession field_revision_field_profession

I cannot understand why the first regex does not work. Thanks for your help.

Replies are listed 'Best First'.
Re: why is regex not matching final character?
by moritz (Cardinal) on May 07, 2012 at 08:10 UTC

      That must be correct. I thought \w would gobble up the whole table name first and [^\s] would stop the gobbling at the first space. But it is matching as much as possible and excluding the last non-space character because it is not in the brackets as you said.

        [...] always matches exactly one character, so the previous \w* has to backtrack and give up one character, in order for the whole match to succeed.

        By the way you can write [^\s] simpler as \S.

        The problem here is \w and \S can match the same characters, so if you match a sequence of \w and \S, the rules about which matches what are governed by the backtracking rules of the regex engine, not by what your intuition expects.

        If you want to match a word, and then want to allow non-word but also non-whitespace characters, you can say (\w+)\s*. The \s* allows empty non-whitespace character sequences too. The regex engine greedily matches as many characters as possible with the \w+, and happily leaves \s* to match the empty string if the following character is a space.

Re: why is regex not matching final character?
by AnomalousMonk (Archbishop) on May 07, 2012 at 11:31 UTC
    I thought \w would gobble up the whole table name first and [^\s] would stop the gobbling at the first space.

    And that's just what happens. In, e.g., 'users', the  (\w*) gobbles (and captures) 'user', and the  [^\s] gobbles (and swallows) 's'. But all that's just what moritz just said.

    Another way of looking at the regex (or any pre-Perl 5.7 regex) is with YAPE::Regex::Explain.

    >perl -wMstrict -le "use YAPE::Regex::Explain; ;; my $rx = qr/^update\ (\w*)[^\s]/; print YAPE::Regex::Explain->new($rx)->explain; " The regular expression: (?-imsx:^update\ (\w*)[^\s]) matches as follows: NODE EXPLANATION ---------------------------------------------------------------------- (?-imsx: group, but do not capture (case-sensitive) (with ^ and $ matching normally) (with . not matching \n) (matching whitespace and # normally): ---------------------------------------------------------------------- ^ the beginning of the string ---------------------------------------------------------------------- update 'update' ---------------------------------------------------------------------- \ ' ' ---------------------------------------------------------------------- ( group and capture to \1: ---------------------------------------------------------------------- \w* word characters (a-z, A-Z, 0-9, _) (0 or more times (matching the most amount possible)) ---------------------------------------------------------------------- ) end of \1 ---------------------------------------------------------------------- [^\s] any character except: whitespace (\n, \r, \t, \f, and " ") ---------------------------------------------------------------------- ) end of grouping ----------------------------------------------------------------------
Re: why is regex not matching final character?
by Anonymous Monk on May 07, 2012 at 16:11 UTC
    Regular expressions are by default greedy. They will match the longest part of the string that matches; not the shortest.