http://www.perlmonks.org?node_id=574763

pbeckingham has asked for the wisdom of the Perl Monks concerning the following question:

Can someone help? I have given myself the challenge of doing some simple parsing, but in a complex way. Without focusing on why I choose to do this, can someone guide me towards a viable solution? Given the following input:

name1=value1 name2 = value2
This code parses it:
while (<$input>) { chomp; next if /^ \s* #/; next if /^ \s* $/; if (/^ \s* ([^=\s]+) \s* = \s* (.+) $/x) { # name is in $1, value is in $2 } }
That's not the question though. The question is, how would I parse the following:
name1=value1 name2 = value2 name3 = value3 but wait, there is more name4= value4
With Perl that has the form:
my $contents = do {local $/; <$input>}; while ($contents =~ / ANSWER_HERE /msg) { # name is in $1, value is in $2 }
Specifically, I want to use the //g form, to iterate over the string, and not perform a line-by-line parse, as in the first example. My attempts have thus far failed. The closest I got (without success) was:
my $contents = do {local $/; <$input>}; my $name = qr/\s* [^=\s]+ \s*/x; while ($contents =~ /^ ($name) = \s* (.+) (?= ^ $name = | $ ) /msg +x) { # name is in $1, value is in $2 }



pbeckingham - typist, perishable vertebrate.

Replies are listed 'Best First'.
Re: Parsing using m//g
by ikegami (Patriarch) on Sep 25, 2006 at 16:00 UTC
    my $contents = do { local $/; <DATA> }; while ($contents =~ / \s* ([^=\s]+) \s* = \s* ( (?: (?! \s* (?: [^=\s]+ \s* = | $ ) ) . )* ) /xmsg ) { print("[$1 => $2]\n"); } __DATA__ name1=value1 name2 = value2 name3 = value3 but wait, there is more name4= value4

    Ouputs

    [name1 => value1] [name2 => value2] [name3 => value3] [name4 => value4]

    Update: The above works by never allowing bad data in the value. The following is an alternate solution that works by starting with an empty value, and extending it as much as possible.

    my $contents = do { local $/; <DATA> }; while ($contents =~ / \s* ([^=\s]+) \s* = \s* (.*?) # Extend the value. (?= \s* (?: [^=\s]+ \s* = | $ ) ) /xmsg ) { print("[$1 => $2]\n"); }

      To be correct, your output would have to be:

      [name1 => value1] [name2 => value2] [name3 => value3 but wait, there is more] [name4 => value4]



      pbeckingham - typist, perishable vertebrate.

        Simply change /.../xmsg to /.../xsg.
        and/or
        Simply change $ to \z.

        Update: Added the second (and better) option.