comment on

Any time you think to yourself "thats too complicated for a regex, and its too complicated for Text::Balanced so I guess I need to write a parser" you should reach for the handy dandy TheDamian approved Parse::RecDescent.

Heres your problem solved using that module. And with an extra thrown in, the grammer supports a {{ hash_key=>({"value"}) }} syntax as well just to show how easy it is. (If you dont like the hash functionality you can remove it easily. :-)

Incidentally it seems like you are using perl like syntax so that ({,,,,1,,,,}) parses in the same way that [,,,,1,,,,] would parse in perl. I took this a step further and gave it the same behaviour as perl, at least as far as its handling of barewords and the phat comma => go. This slightly complicated matters but not too much.

Anyway, what you haven't really said (or I didnt notice :-) is what happens if you try to parse "". This code will fail (return undef) in that case.

UPDATE: Sigh. Like a muppet I didnt pay attention to the fact that you _have_ provided a hash notation specification. Well, heres a version that handles that. Ive kept my original perl version though. Sorry about that. Anyway, heres a version that parsers AFAICT what you want. Notice it isnt all that different from the other version. (Ive readmore'd my original stuff)

use strict;
use warnings;

use Parse::RecDescent;
use Data::Dumper;

$::RD_WARN  = 0 || undef;       # unless undefined, also report non-fa
+tal problems
$::RD_HINT  = 0 || undef;       # if defined, also suggestion remedies
$::RD_TRACE = 0 || undef;       # if defined trace the parse

$::RD_AUTOACTION ='$item[1]';
my $parser = Parse::RecDescent->new(<<'ENDGRAMMAR');
# BEGIN GRAMMAR

# we want to parse things like...
# ({ 1, 2, "three", 0, ({ "internal", "array", 0, }), "end", })

# Base Rule (start-rule)
expr : value /\z/

# Recursion point
value : array
      | hash
      | string
      | number

# list of items. returns an arrayref
# we grep out the undefs so  ",,1,," is treated as one element and not
# three or five or any other number :-)
array : "({" <leftop: val_or_empty ',' val_or_empty > "})"
      {
          [ grep !UNIVERSAL::isa($_,"Value::Empty"),@{$item[2]} ]
      }

hash  : "([" <leftop: keyvalue ',' keyvalue > "])"
       {
             $return={};
             !UNIVERSAL::isa($_,"Value::Empty") and
                ($return->{$_->[0]}=$_->[1]) foreach @{$item[2]};
       }

keyvalue : string_or_number ":" value
         {
             [ @item[1,3] ]
         }
         | empty


string_or_number : string
                 | number

val_or_empty : value
             | empty

empty : ""
      {
        bless \do{my $x},"Value::Empty"
      }

# quoted escaped string. escaping reversed and quotes removed.
string : /"((?:[^"\\]+|\\"|\\)+)"/
       {
            my $ret=$1;
            $ret=~s/\\"/"/g;
            $ret;
       }

# number. could be a better regex
number : /\d+/

ENDGRAMMAR


my $tests=<<'ENDTEST';

({ 1, 2, "three", 0, ({ "internal", "array", 0, }), "end", })
1
"This is \"quoted\" dude"
({})
({({}),({})})
({"wow",({1}),"bob",({1}),"cool"})
(["cool":1,"subhash":(["b":"c",1:2]),"a":({1,2,3,([]),}),])
ENDTEST

foreach my $test (split /\n/,$tests) {
    print "------------------\n\n";
    my $value=$parser->expr($test);
    print "'$test'\n";
    if ( defined $value ) {

        print "\nproduced the following structure:\n\n";
        print Data::Dumper->new([$value])
                          ->Terse(1)
                          ->Indent(1)
                          ->Dump(),"\n";
    } else {
        print "Failed to parse!\n";
    }
}
[download]

Heres the version that parses much more like perl. Its what I produced not having noticed the hash notation spec you provided. Sigh. END UPDATE

use strict;
use warnings;

use Parse::RecDescent;
use Data::Dumper;

$::RD_WARN  = 0 || undef;       # if defined, report non-fatal probs
$::RD_HINT  = 0 || undef;       # if defined, give suggestion remedies
$::RD_TRACE = 0 || undef;       # if defined trace the parse

$::RD_AUTOACTION ='$item[1]';
my $parser = Parse::RecDescent->new(<<'ENDGRAMMAR');
# BEGIN GRAMMAR

# we want to parse things like...
# ({ 1, 2, "three", 0, ({ "internal", "array", 0, }), "end", })

# Base Rule (start-rule)
expr : value /\z/

# Recursion point
value : array
      | hash
      | string
      | number
      | bareword

# list of items. returns an arrayref
# we grep out the undefs so  ",,1,," is treated as one element and not
# three or five or any other number :-)
list  : <leftop: val_or_empty comma val_or_empty >
      {
         my @items=grep !UNIVERSAL::isa($_,"Value::Empty"),@{$item[1]}
+;
         $return=[];
         while (@items) {
             if (UNIVERSAL::isa($items[0],"Value::Bareword")) {
                 if (!UNIVERSAL::isa($items[1],"List::Seperator")
                     or ${$items[1]} ne "=>") {
                     warn "Dont know what to do with bareword '$items[
+1]'\n";
                     undef $return;
                     last;
                 } else {
                     $items[0]=${$items[0]};
                 }
             }
             push @$return,shift @items;
             shift @items while UNIVERSAL::isa($items[0],"List::Sepera
+tor");
         }
         $return
      }

array : "({" list "})"
      {
          $item[2]
      }

hash  : "{{" list "}}"
       {
             my $items=$item[2];
             if (@$items % 2) {
                 push @$items,undef;
                 warn "Uneven list in hash creation\n";
             }
             $return = { @$items };
       }


val_or_empty : value
             | ""
             {
                  bless \do{my $x},"Value::Empty"
             }

# quoted escaped string. escaping reversed and quotes removed.
string : /"((?:[^"\\]+|\\"|\\)+)"/
       {
            my $ret=$1;
            $ret=~s/\\"/"/g;
            $ret;
       }

# number. could be a better regex
number : /\d+/

# a bareword. well figurout if its legal later
bareword : /\w+/
         {
             bless \do{ my $x=$item[1] },"Value::Bareword"
         }

comma : /,|=>/
         {
             bless \do{ my $x=$item[1] },"List::Seperator"
         }
ENDGRAMMAR


my $tests=<<'ENDTEST';

({ 1, 2, "three", 0, ({ "internal", "array", 0, }), "end", })
1
"This is \"quoted\" dude"
({})
({({}),({})})
({"wow",({1}),"bob",({1}),"cool"})
{{"cool",1,subhash=>{{b=>"c",1=>2}},a=>({1,2,3,{{}},}),}}
({what=>a=>mess=>this=>all=>"is"})
({what=>=>=>is=>=>=>the=>=>=>point=>=>=>of=>=>=>it=>=>=>"all?",,,})
ENDTEST

foreach my $test (split /\n/,$tests) {
    print "------------------\n\n";
    my $value=$parser->expr($test);
    print "'$test'\n";
    if ( defined $value ) {

        print "\nproduced the following structure:\n\n";
        print Data::Dumper->new([$value])
                          ->Terse(1)
                          ->Indent(1)
                          ->Dump(),"\n";
    } else {
        print "Failed to parse!\n";
    }
}
[download]

Which outputs (single quotes are produced by Dumper)

------------------

''
Failed to parse!
------------------

'({ 1, 2, "three", 0, ({ "internal", "array", 0, }), "end", })'

produced the following structure:

[
  '1',
  '2',
  'three',
  '0',
  [
    'internal',
    'array',
    '0'
  ],
  'end'
]

------------------

'1'

produced the following structure:

'1'

------------------

'"This is \"quoted\" dude"'

produced the following structure:

'This is "quoted" dude'

------------------

'({})'

produced the following structure:

[]

------------------

'({({}),({})})'

produced the following structure:

[
  [],
  []
]

------------------

'({"wow",({1}),"bob",({1}),"cool"})'

produced the following structure:

[
  'wow',
  [
    '1'
  ],
  'bob',
  [
    '1'
  ],
  'cool'
]

------------------

'{{"cool",1,subhash=>{{b=>"c",1=>2}},a=>({1,2,3,{{}},}),}}'

produced the following structure:

{
  'cool' => '1',
  'a' => [
    '1',
    '2',
    '3',
    {}
  ],
  'subhash' => {
    '1' => '2',
    'b' => 'c'
  }
}

------------------

'({what=>a=>mess=>this=>all=>"is"})'

produced the following structure:

[
  'what',
  'a',
  'mess',
  'this',
  'all',
  'is'
]

------------------

'({what=>=>=>is=>=>=>the=>=>=>point=>=>=>of=>=>=>it=>=>=>"all?",,,})'

produced the following structure:

[
  'what',
  'is',
  'the',
  'point',
  'of',
  'it',
  'all?'
]
[download]

Incidentally the secret to easily parsing stuff like this using P::RD is to learn how to use <leftop:> properly.

Personally since I learned of and about Parse::RecDescent I would never write a parser in perl without it. (Well, maybe with another Parse:: module instead...)

HTH, and thanks for the post. I found this a fun use of P::RD. To be honest I rewrote it a couple of times as I played with different ways to do this (and extend it). Even still assuming you grok <leftop:> I think a beginner with P::RD could pull off your original requirements in pretty good time. I say this only to reduce the chance that you get intimdated by P::RD's rather long (for good reason) documentation. Its actually a lot easier than it seems at first glance most times.

NOTE TO THE EXPERIENCED P::RD USER How do you return undef from an action without causing the rule to fail? I dont see any way to do it, which means to me that some kind of unwrapping potentially becomes necessary after the parse. Any ideas?

Cheers,

--- demerphq
my friends call me, usually because I'm late....

In reply to Re: Parsing Text into Arrays.. by demerphq
in thread Parsing Text into Arrays.. by castaway

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


good chemistry is complicated, and a little bit messy -LW
	PerlMonks