Any time you think to yourself "thats too complicated for a regex, and its too complicated for
Text::Balanced so I guess I need to write a parser" you should reach for the handy dandy
TheDamian approved
Parse::RecDescent.
Heres your problem solved using that module. And with an extra thrown in, the grammer supports a {{ hash_key=>({"value"}) }} syntax as well just to show how easy it is. (If you dont like the hash functionality you can remove it easily. :-)
Incidentally it seems like you are using perl like syntax so that ({,,,,1,,,,}) parses in the same way that [,,,,1,,,,] would parse in perl. I took this a step further and gave it the same behaviour as perl, at least as far as its handling of barewords and the phat comma => go. This slightly complicated matters but not too much.
Anyway, what you haven't really said (or I didnt notice :-) is what happens if you try to parse "". This code will fail (return undef) in that case.
UPDATE: Sigh. Like a muppet I didnt pay attention to the fact that you _have_ provided a hash notation specification. Well, heres a version that handles that. Ive kept my original perl version though. Sorry about that. Anyway, heres a version that parsers AFAICT what you want. Notice it isnt all that different from the other version. (Ive readmore'd my original stuff)
use strict;
use warnings;
use Parse::RecDescent;
use Data::Dumper;
$::RD_WARN = 0 || undef; # unless undefined, also report non-fa
+tal problems
$::RD_HINT = 0 || undef; # if defined, also suggestion remedies
$::RD_TRACE = 0 || undef; # if defined trace the parse
$::RD_AUTOACTION ='$item[1]';
my $parser = Parse::RecDescent->new(<<'ENDGRAMMAR');
# BEGIN GRAMMAR
# we want to parse things like...
# ({ 1, 2, "three", 0, ({ "internal", "array", 0, }), "end", })
# Base Rule (start-rule)
expr : value /\z/
# Recursion point
value : array
| hash
| string
| number
# list of items. returns an arrayref
# we grep out the undefs so ",,1,," is treated as one element and not
# three or five or any other number :-)
array : "({" <leftop: val_or_empty ',' val_or_empty > "})"
{
[ grep !UNIVERSAL::isa($_,"Value::Empty"),@{$item[2]} ]
}
hash : "([" <leftop: keyvalue ',' keyvalue > "])"
{
$return={};
!UNIVERSAL::isa($_,"Value::Empty") and
($return->{$_->[0]}=$_->[1]) foreach @{$item[2]};
}
keyvalue : string_or_number ":" value
{
[ @item[1,3] ]
}
| empty
string_or_number : string
| number
val_or_empty : value
| empty
empty : ""
{
bless \do{my $x},"Value::Empty"
}
# quoted escaped string. escaping reversed and quotes removed.
string : /"((?:[^"\\]+|\\"|\\)+)"/
{
my $ret=$1;
$ret=~s/\\"/"/g;
$ret;
}
# number. could be a better regex
number : /\d+/
ENDGRAMMAR
my $tests=<<'ENDTEST';
({ 1, 2, "three", 0, ({ "internal", "array", 0, }), "end", })
1
"This is \"quoted\" dude"
({})
({({}),({})})
({"wow",({1}),"bob",({1}),"cool"})
(["cool":1,"subhash":(["b":"c",1:2]),"a":({1,2,3,([]),}),])
ENDTEST
foreach my $test (split /\n/,$tests) {
print "------------------\n\n";
my $value=$parser->expr($test);
print "'$test'\n";
if ( defined $value ) {
print "\nproduced the following structure:\n\n";
print Data::Dumper->new([$value])
->Terse(1)
->Indent(1)
->Dump(),"\n";
} else {
print "Failed to parse!\n";
}
}
Heres the version that parses much more like perl. Its what I produced not having noticed the hash notation spec you provided. Sigh.
END UPDATE
use strict;
use warnings;
use Parse::RecDescent;
use Data::Dumper;
$::RD_WARN = 0 || undef; # if defined, report non-fatal probs
$::RD_HINT = 0 || undef; # if defined, give suggestion remedies
$::RD_TRACE = 0 || undef; # if defined trace the parse
$::RD_AUTOACTION ='$item[1]';
my $parser = Parse::RecDescent->new(<<'ENDGRAMMAR');
# BEGIN GRAMMAR
# we want to parse things like...
# ({ 1, 2, "three", 0, ({ "internal", "array", 0, }), "end", })
# Base Rule (start-rule)
expr : value /\z/
# Recursion point
value : array
| hash
| string
| number
| bareword
# list of items. returns an arrayref
# we grep out the undefs so ",,1,," is treated as one element and not
# three or five or any other number :-)
list : <leftop: val_or_empty comma val_or_empty >
{
my @items=grep !UNIVERSAL::isa($_,"Value::Empty"),@{$item[1]}
+;
$return=[];
while (@items) {
if (UNIVERSAL::isa($items[0],"Value::Bareword")) {
if (!UNIVERSAL::isa($items[1],"List::Seperator")
or ${$items[1]} ne "=>") {
warn "Dont know what to do with bareword '$items[
+1]'\n";
undef $return;
last;
} else {
$items[0]=${$items[0]};
}
}
push @$return,shift @items;
shift @items while UNIVERSAL::isa($items[0],"List::Sepera
+tor");
}
$return
}
array : "({" list "})"
{
$item[2]
}
hash : "{{" list "}}"
{
my $items=$item[2];
if (@$items % 2) {
push @$items,undef;
warn "Uneven list in hash creation\n";
}
$return = { @$items };
}
val_or_empty : value
| ""
{
bless \do{my $x},"Value::Empty"
}
# quoted escaped string. escaping reversed and quotes removed.
string : /"((?:[^"\\]+|\\"|\\)+)"/
{
my $ret=$1;
$ret=~s/\\"/"/g;
$ret;
}
# number. could be a better regex
number : /\d+/
# a bareword. well figurout if its legal later
bareword : /\w+/
{
bless \do{ my $x=$item[1] },"Value::Bareword"
}
comma : /,|=>/
{
bless \do{ my $x=$item[1] },"List::Seperator"
}
ENDGRAMMAR
my $tests=<<'ENDTEST';
({ 1, 2, "three", 0, ({ "internal", "array", 0, }), "end", })
1
"This is \"quoted\" dude"
({})
({({}),({})})
({"wow",({1}),"bob",({1}),"cool"})
{{"cool",1,subhash=>{{b=>"c",1=>2}},a=>({1,2,3,{{}},}),}}
({what=>a=>mess=>this=>all=>"is"})
({what=>=>=>is=>=>=>the=>=>=>point=>=>=>of=>=>=>it=>=>=>"all?",,,})
ENDTEST
foreach my $test (split /\n/,$tests) {
print "------------------\n\n";
my $value=$parser->expr($test);
print "'$test'\n";
if ( defined $value ) {
print "\nproduced the following structure:\n\n";
print Data::Dumper->new([$value])
->Terse(1)
->Indent(1)
->Dump(),"\n";
} else {
print "Failed to parse!\n";
}
}
Which outputs (single quotes are produced by Dumper)
------------------
''
Failed to parse!
------------------
'({ 1, 2, "three", 0, ({ "internal", "array", 0, }), "end", })'
produced the following structure:
[
'1',
'2',
'three',
'0',
[
'internal',
'array',
'0'
],
'end'
]
------------------
'1'
produced the following structure:
'1'
------------------
'"This is \"quoted\" dude"'
produced the following structure:
'This is "quoted" dude'
------------------
'({})'
produced the following structure:
[]
------------------
'({({}),({})})'
produced the following structure:
[
[],
[]
]
------------------
'({"wow",({1}),"bob",({1}),"cool"})'
produced the following structure:
[
'wow',
[
'1'
],
'bob',
[
'1'
],
'cool'
]
------------------
'{{"cool",1,subhash=>{{b=>"c",1=>2}},a=>({1,2,3,{{}},}),}}'
produced the following structure:
{
'cool' => '1',
'a' => [
'1',
'2',
'3',
{}
],
'subhash' => {
'1' => '2',
'b' => 'c'
}
}
------------------
'({what=>a=>mess=>this=>all=>"is"})'
produced the following structure:
[
'what',
'a',
'mess',
'this',
'all',
'is'
]
------------------
'({what=>=>=>is=>=>=>the=>=>=>point=>=>=>of=>=>=>it=>=>=>"all?",,,})'
produced the following structure:
[
'what',
'is',
'the',
'point',
'of',
'it',
'all?'
]
Incidentally the secret to easily parsing stuff like this using P::RD is to learn how to use <leftop:> properly.
Personally since I learned of and about Parse::RecDescent I would never write a parser in perl without it. (Well, maybe with another Parse:: module instead...)
HTH, and thanks for the post. I found this a fun use of P::RD. To be honest I rewrote it a couple of times as I played with different ways to do this (and extend it). Even still assuming you grok <leftop:> I think a beginner with P::RD could pull off your original requirements in pretty good time. I say this only to reduce the chance that you get intimdated by P::RD's rather long (for good reason) documentation. Its actually a lot easier than it seems at first glance most times.
NOTE TO THE EXPERIENCED P::RD USER How do you return undef from an action without causing the rule to fail? I dont see any way to do it, which means to me that some kind of unwrapping potentially becomes necessary after the parse. Any ideas?
Cheers,
--- demerphq
my friends call me, usually because I'm late....
-
Are you posting in the right place? Check out Where do I post X? to know for sure.
-
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big>
<blockquote> <br /> <dd>
<dl> <dt> <em> <font>
<h1> <h2> <h3> <h4>
<h5> <h6> <hr /> <i>
<li> <nbsp> <ol> <p>
<small> <strike> <strong>
<sub> <sup> <table>
<td> <th> <tr> <tt>
<u> <ul>
-
Snippets of code should be wrapped in
<code> tags not
<pre> tags. In fact, <pre>
tags should generally be avoided. If they must
be used, extreme care should be
taken to ensure that their contents do not
have long lines (<70 chars), in order to prevent
horizontal scrolling (and possible janitor
intervention).
-
Want more info? How to link
or How to display code and escape characters
are good places to start.