Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options

Parsing arguments

by hv (Parson)
on Feb 20, 2003 at 17:14 UTC ( #237137=perlquestion: print w/replies, xml ) Need Help??
hv has asked for the wisdom of the Perl Monks concerning the following question:

I'm writing code to parse a simplish string that has components like:

bareword "quoted string" 'quoted string' bareword = bareword bareword = "quoted string" bareword = 'quoted string'
and can also have a separating ':' somewhere in there.

My code for this currently looks like this:

  while ($tag =~ m{


while ($tag =~ m{ \G (?: ( \w+ ) (?: \s* = \s* (?: ( \w+ ) | ' ([^']*) ' | " ([^"]*) " ) )? | ' ([^']*) ' | " ([^"]*) " | ( : ) ) (?= \s | \z) \s* ) }gcxs) { push @args, defined($5) ? $5 # 'quoted string' : defined($6) ? $6 # "quoted string" : defined($7) ? $7 # : : defined($2) ? [ $1, $2 ] # bareword=bareword : defined($3) ? [ $1, $3 ] # bareword='quoted string' : defined($4) ? [ $1, $4 ] # bareword="quoted string" : $1 # bareword ; }
but that feels like an ugly way to do things - there's duplication of chunks of the pattern, and all those assumptions about the capture numbering. Surely there must be a better way to do this?


Update per author - dvergin 2003-02-21

Replies are listed 'Best First'.
Re: Parsing arguments
by tadman (Prior) on Feb 20, 2003 at 17:26 UTC
    This is usually a lot trickier than the example code you have here for reasons such as:
    bareword='quoted\'s string' bareword='quoted''s string' bareword="quoted \"string\""
    I'm not sure if that's something you're going to have to deal with, but it's always good to have a complete test case.

      Agreed, though there's no current requirement to support escaping in the quoted strings. But that's exactly the sort of reason I'm unhappy about the duplicated regexp chunks, since it'd be easy to enhance one of the ' ([^']*) ' fragments and miss the other.

      Similarly, I could collapse the single- and double-quoted string checks with a style like   (["']) (.*?) \1, but that can make tracking the capture numbers even harder, since now a change in the regexp would mean updating the numbering elsewhere in the pattern as well as in the push @args, ... code.

Re: Parsing arguments
by xmath (Hermit) on Feb 20, 2003 at 18:52 UTC
    Here you go.. not much duplication, supports escaping quotes, and properly fails if there's garbage in the input (you can replace the die with more elegant error handling)

    # skip initial whitespace, if any $tag =~ /^\s*/gcxs; # look for optional key, and a bareword or the start of a quoted strin +g while ($tag =~ / \G (?: (\w+) \s* = \s*)? ( ['"](?=.) | \w+ ) /gcxs) { my ($k, $v) = ($1, $2); # extract quoted string if ($v eq "'" || $v eq '"') { $tag =~ / \G ( (?: \\. | [^$v] )* ) $v /gcxs or last; $v = $1; $v =~ s/\\(.)/$1/g; # unescape characters } # skip optional separator $tag =~ / \G \s* :? \s* /gcxs; # save value push @args, $k ? [$k, $v] : $v; } # check if parsing was successful die if pos $tag != length $tag;

    •Update: If you like, you can ofcourse precompile the quote-patterns so you don't have any variable patterns, like:

    my %quotes; $quotes{$_} = qr/ \G ( (?: \\. | [^$_] )* ) $_ /xs for qw(' ");
    and then replace the if-block with:
    if (my $pat = $quotes{$v}) { $tag =~ /$pat/gc or last; $v = $1; $v =~ s/\\(.)/$1/g; }
    But I don't know if that results in any significant increase in speed.

      Thanks, that's an interesting approach. I find the logic still rather complex, though, and I think if I were going in this direction I'd separate it out a bit differently:

      while (pos($tag) < length($tag)) { if (m{ \G (\w+) \s* = }gcx) { push @args [ $1 ]; } elsif (m{ \G (\w+) (?= \s* | \z ) \s* }gcx) { push @args, $1; } elsif (m{ \G (['"]) ( \\. | [^\\] )*? \1 \s* }gcx) { (my $quoted = $2) =~ s/\\(.)/$1/g; push @args, $quoted; } else { die "parsing error\n"; } } for (my $i = 0; $i < @args; ++$i) { $args[$i][1] = splice(@args, $i + 1, 1) if ref $args[$_]; }

        There are many variations possible. I have to admit yours is simpler, although I should note it parses a different language than your original request ("foo=foo=foo=foo=" is considered valid in this version)

        BTW, benchmarks have shown using .*?D is slower than [^D]*D (where D is the delimiter).

        Also note using [^\\] is not necessary, though harmless (since if the char is a backslash, the \\. will match unless the backslash is end the end, which case there's also no delimiter and the whole pattern will not match).

        And finally, in your original you used the /s while you're not using it here.. I don't know if that's deliberate or a mistake, but I thought I'd note it.

        •Update: and I just noticed you completely forgot support for the colon-delimiter (although that's not hard to add). Also that (?= \s* | \z ) zero-width assertion is completely futile since it also matches 0 chars (due to the \s*).

Re: Parsing arguments
by Thelonius (Priest) on Feb 20, 2003 at 18:48 UTC
    Unless you really feel the need to write your own patterns, you should check out Text::ParseWords. The quotewords() and shellwords() functions will make life much easier for you.

      Thanks, I'll have to take a closer look at that. At first glance it seems I'll still have to do some parsing on the resulting token stream, and I'm not sure how much cleaner the resulting code would be. In particular, I'd have to keep the delimiters to be able to distinguish between key=value and "not=a key", and then reparse them out again.

Re: Parsing arguments
by Pardus (Pilgrim) on Feb 20, 2003 at 20:15 UTC
    I'm afraid I can only offer the same reply I gave castaway earlier today - see node 237066
    Jaap Karssenberg || Pardus (Larus)? <>
    >>>> Zoidberg: So many memories, so many strange fluids gushing out of patients' bodies.... <<<<

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://237137]
Approved by Paladin
Front-paged by broquaint
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others examining the Monastery: (5)
As of 2017-02-20 12:21 GMT
Find Nodes?
    Voting Booth?
    Before electricity was invented, what was the Electric Eel called?

    Results (295 votes). Check out past polls.