Re: Complex Splitting - Parse::RecDescent

in reply to Complex Splitting

If the string being split will always be well-formed then I would go with one of the regex solutions provided above. If there is a possiblity that the data will be malformed then you may be better off with a parser approach as it would allow for more flexibility in error handling.

Here is a solution using Parse::RecDescent.

use Parse::RecDescent;
use strict;
use warnings;

my $str = "ABC[GHI]XY[Z]1A";
my $grammar = <<'GRAMMAR';
  token : '[' /[A-Z]*/ ']' {$return = $item[2]}
           | /[A-Z]/
  anything : /./
GRAMMAR

my $parser = Parse::RecDescent->new($grammar);

# When a reference to a scalar is passed to Parse::RecDescent it will
# consume the tokens as they are matched. To avoid modifying the origi
+nal
# string a copy will be used
my $copy = $str;
while ($copy ne '') {
    if (my $token = $parser->token(\$copy)) {
        print "Token: $token\n";
    }
    else {
        my $token = $parser->anything(\$copy);
        print "Invalid symbol: $token\n";
    }
}
[download]

Comment on Re: Complex Splitting - Parse::RecDescent Download Code

Replies are listed 'Best First'.
Re^2: Complex Splitting - /\G.../gc by ikegami (Patriarch) on Feb 07, 2007 at 08:00 UTC
That silently ignores whitespace (read up on `<skip>`). Also, P::RD is rather slow. I'd even say inexcusably slow if you're just using it as a tokenizer. May I suggest a much faster tokenizer? `use strict; use warnings; sub process_token { my ($token) = @_; print("Token: $token\n"); } { my $str = "ABC[GHI]XY[Z]1A"; for ($str) { /\G \[ ([A-Z]*) \] /xgcs && do { process_token("$1"); redo }; /\G ([A-Z]) /xgcs && do { process_token("$1"); redo }; /\G (.) /xgcs && do { printf("Unexpected '%s' at pos %d\n", $1, pos()-length($1)); redo }; } }` [download]	[reply] [d/l] [select]

Replies are listed 'Best First'.

Re^2: Complex Splitting - /\G.../gc
by ikegami (Patriarch) on Feb 07, 2007 at 08:00 UTC

That silently ignores whitespace (read up on <skip>).

Also, P::RD is rather slow. I'd even say inexcusably slow if you're just using it as a tokenizer. May I suggest a much faster tokenizer?

use strict;
use warnings;

sub process_token {
   my ($token) = @_;
   print("Token: $token\n");
}

{
   my $str = "ABC[GHI]XY[Z]1A";

   for ($str) {
      /\G \[ ([A-Z]*) \] /xgcs && do {
         process_token("$1");
         redo
      };

      /\G ([A-Z]) /xgcs && do {
         process_token("$1"); 
         redo
      };

      /\G (.) /xgcs && do {
         printf("Unexpected '%s' at pos %d\n", $1, pos()-length($1));
         redo
      };
   }
}
[download]

[reply]
[d/l]
[select]

In Section Seekers of Perl Wisdom