Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options

Regex capturing either quoted strings or bare words

by gmax (Abbot)
on Jan 13, 2003 at 17:35 UTC ( #226531=perlquestion: print w/replies, xml ) Need Help??
gmax has asked for the wisdom of the Perl Monks concerning the following question:

I am writing a simple script language that needs to parse arguments passed as keyword=value pairs within the same string.
There are a few constraints, which account for some additional complexity:
  • Each pair can be either on a separate line or merged in a single line
  • Spaces are allowed before and after the equal (=) sign
  • Values can be either barewords or quoted strings. Quote symbols may be single, double or inverse.
  • Values may contain spaces and the equal sign
  • Values may contain escaped quotes.
Example source strings are:
my @all_keywords = (qw(one two three )); my @mandatory_keywords = (qw(two three )); my $source1 = q{ONE="xyz\t" two=a_34 three = 'name="O\'Hara"' }; my $source2 = <<'END'; two=a_34 three = 'name="O\'Hara"' ONE="xyz\t" END
The desired output, from both sources, is a hash containing
my %statement = ( one => 'xyz\t', two => 'a_34', three => q{name="O\'Hara"}, );
In addition, I need to make sure that all the keywords are valid ones, and that the mandatory keywords are defined. Meeting all the requirements is not extremely difficult.
Please have a look at my test code. (The real code is a full-fledged module).
#!/usr/bin/perl -w use strict; my @all_keywords = (qw(one two three four five)); my @mandatory_keywords = (qw(two three four )); my $RE_value = qr/ (\w+) # (1) a keyword \s* = \s* # an equal sign with optional spaces (?: # quoted keyword ... ( # [\'\"\`] # (2) a quoting character ) ( # (3) the quoted value: (?: # either \\\2 # an escaped quote | # or [^\2] # any non-quote character ) +? # repeat (non-greedily) ) \2 # until the initial quote shows up again | (\S+) # (4) ... bare word value ) /x; sub set_value { my ($stat, $kw, $value) = (@_); # case insensitive keyword return 0 unless exists $stat->{lc $kw}; $stat->{lc $kw} = $value; return 1 } sub parse_pairs { my $src = shift; my %statement = map {$_, undef} @all_keywords; for ($src) { while ( ! m/ \G \s* \z /gcx ) { my $result = 0; if ( /\G \s* $RE_value \s* /xgc ) { $result = set_value( \%statement, $1, $4 ? $4 : $3 ); } else { die "syntax error >" . substr($_, pos) ."\n"; } die "invalid keyword $1 \n" unless $result; } } return \%statement; } sub check_pairs { my $statement = shift; for my $kw (@all_keywords) { if (defined $statement->{$kw}) { print "$kw \t -> <$statement->{$kw}>\n" } else { warn "- missing keyword <$kw>!\n" if grep {$kw eq $_} @mandatory_keywords; } } } my @sources = ( q{ ONE="xyz\t" two=a_34 three = 'name="O\'Hara"' four=`'one' two` fi +ve = ah! }, q{ five = ah! ONE="xyz\t" three = 'name="O\'Hara" two=a_34' four=`'one' two` }); for (@sources) { print "\n>>Source: //$_//\n\n"; my $stat = parse_pairs($_); check_pairs($stat); } __END__ output: >>Source: // ONE="xyz\t" two=a_34 three = 'name="O\'Hara"' four=`'one' + two` five = ah! // one -> <xyz\t> two -> <a_34> three -> <name="O\'Hara"> four -> <'one' two> five -> <ah!> >>Source: // five = ah! ONE="xyz\t" three = 'name="O\'Hara" two=a_34' four=`'one' two` // one -> <xyz\t> - missing keyword <two>! three -> <name="O\'Hara" two=a_34> four -> <'one' two> five -> <ah!>
This Regex rightly captures both the barewords and the quoted strings, taking care of embedded quotes and the escaped quote in the name.

(1) Could I have achieved the same result using any standard module?
(2) Also, does anyone spot any weakness where the paradigm may break?
So far, it is strong enough to handle correctly sources like
q{one="two=xyz" two=abc} # ^embedded keyword pattern q{one="xyz two=abc three= efg"} # ^missing quotes^
In the first case, the value for two is eaten up by the engine, so it starts examining for a new match after the quoted string, thus rigthly assigning "abc" to two and "two=xyz" to one.
The second case is an input mistake, and the error is found during the check at the end of the loop.
Also, about the preparation work, I had a look at Text::Balanced, which can deal with all the quotes, but it is not clear to me if and how it can also deal with barewords at the same time, and how it could fit in the engine.
 _  _ _  _  
(_|| | |(_|><

Replies are listed 'Best First'.
Re: Regex capturing either quoted strings or bare words (final backslash)
by tye (Sage) on Jan 13, 2003 at 18:53 UTC

    So how do you include a value that contains a space and ends with a backslash?

    You should change \\\2 to \\. and then decide which of three treatments you want:

    1. \x always becomes x
    2. \x stays \x except that \" becomes " and \\ becomes \
    3. \x stays \x except that \" becomes " and \\" becomes \" and \\\" becomes \\" etc.

    But I find a much better method is to not use \ for escaping embedded quote characters if that is the only character you want to escape. Instead, use two adjacent quote characters to represent one embedded quote character.

    That is, change \\\2 to \2\2 and then post-process the match to undouble the embedded quote characters.

    One problem with this approach is if you end up nesting lots of these constructs you'll end up with:     q{one="two=""three=""""a b""""""" two=abc}
    but that isn't much worse than the alternative of     q{one="two=\"three=\\\"a b\\\"\"" two=abc}
    and allowing multiple quote characters (like you have) is the real solution to such problems     q{one="two='three=`a b`'" two=abc}
    and avoiding a single escape character is why I prefer my approach.

    Update: I wouldn't use a non-greedy match. I'd also be more strict so the regex engine doesn't have any option about matching things other than the way I want it to. So in your original code [^\2] should be [^\\\2] (though I recall [^\2] not working when I tested it so perhaps this means that your code won't work on older versions of Perl).

    You don't want the regex engine to decide to look at 'I\'m' and match \ against [^\2] and then have the middle ' terminate the string too early. Right now this probably won't happen due to subtle rules (I assume, based on your testing -- the rules are subtle enough that I'd have guessed that the regex would go the other route) but this leeway means that the regex can backtrack when a closing quote is missing and match a different quote in the manner I describe. You don't want to allow this.

    You should also allow empty strings (so change +? to just *). And I'd use [^\2]+ in hopes of being more efficient, but such concerns should be considered last.

    Update2: I notice you use \t in your values but I don't see you dealing with that anywhere. Is that supposed to stay \t or become a tab? Or is that just to test that other backslashes doesn't get eaten? For that matter, I don't see where you turn \' into ' so...

    And no need to backslash the quotes in a character class so you can use ["'`] instead (though it doesn't hurt either).

    You might want to look at Regex::Common to compare how it does some of these things. Unfortunately, reading the code of that module is rather difficult. Luckilly, you can just print out the regexes it gives back to you instead. (:

                    - tye
      Many thanks for your analysis. You have pinpointed many of the risks that I overlooked in my tests.
      I didn't think about a string ending with a backslash. I will follow your suggestion about doubling the quotes. It is unlikely to happen that much, though. Having three different quotes available, the users should be able to use the more appropriate symbol to avoid clashes.
      "\t" was just a test, it wasn't supposed to be extended (at least in this script.)
       _  _ _  _  
      (_|| | |(_|><
Re: Regex capturing either quoted strings or bare words
by BrowserUk (Pope) on Jan 13, 2003 at 18:22 UTC

    Sounds similar to, but not the same as the sort of thing that Config::Simple & Config::General modules do.

    Not that I have any great experience with them, but they might be worth your while looking at if you haven't already.

    Examine what is said, not who speaks.

    The 7th Rule of perl club is -- pearl clubs are easily damaged. Use a diamond club instead.

Re: Regex capturing either quoted strings or bare words
by ihb (Deacon) on Jan 13, 2003 at 17:51 UTC
    This seems like a typical job for Parse::RecDescent. There are tutorials to be found through google.

Re: Regex capturing either quoted strings or bare words
by IlyaM (Parson) on Jan 14, 2003 at 11:15 UTC
    Also, about the preparation work, I had a look at Text::Balanced, which can deal with all the quotes, but it is not clear to me if and how it can also deal with barewords at the same time, and how it could fit in the engine.

    Take a look on sources of _parse_param() subroutine in HTTP::WebTest::Parser which parses such name/value pairs where values are optionally quoted. It uses Text::Balanced to extract quoted values.

    Ilya Martynov,
    CTO IPonWEB (UK) Ltd
    Quality Perl Programming and Unix Support UK managed @ offshore prices -
    Personal website -