pip9ball has asked for the wisdom of the Perl Monks concerning the following question:
Hi,
I have the following string:
my $string = "<*2>FOO<2,1>,<*3>(SigB<8:0:2>,BAR),<*2>Siga<2:0>,Sigb<8,7,6,5,0>";
I need a way to split on commas, but not the commas inside
'<>' or '()' characters
What I've been trying to do is first replace all commas inside the '<>' and '()' with '|||', then to the split on commas, followed by another replacement of '|||' to commas.
Is there a better way to do this?
Thanks!
Re: split on commas (Parse::RecDescent)
by ikegami (Patriarch) on Jun 05, 2009 at 19:12 UTC
|
Considering you're constantly adding more requirements, it's my opinion that it's time to move to a full-fledged parser.
#!/usr/bin/perl
# make_parser.pl
use strict;
use warnings;
use Parse::RecDescent qw( );
my $grammar = <<'__END_OF_GRAMMAR__';
{
use strict;
use warnings;
}
parse : <skip:''> item_list /\Z/ { $item[2] }
item_list : <leftop: item ',' item> { $item[1] }
item : prefix bodies { [ $item[1], $item[2] ] }
| bodies { [ undef, $item[1] ] }
bodies : group { $item[1] }
| body { [ $item[1] ] }
prefix : '<*' INTEGER '>' { $item[2] }
group : '(' <leftop: body ',' body> ')' { $item[2] }
body : NAME suffix(?) { [ $item[1], @{ $item[2] } ] }
suffix : '<' suffix_list '>' { $item[2] }
suffix_list : INTEGER suffix_list_[ $item[1] ]
suffix_list_ : ',' <leftop: INTEGER ',' INTEGER>
{ [ ',' => $arg[0], @{ $item[2] } ] }
| ':' <leftop: INTEGER ':' INTEGER>
{ [ ':' => $arg[0], @{ $item[2] } ] }
| { [ '' => $arg[0] ] }
INTEGER : /0|[1-9][0-9]*/
NAME : /[A-Za-z][A-Za-z0-9_]*/
__END_OF_GRAMMAR__
Parse::RecDescent->Precompile($grammar, 'Grammar')
or die("Bad grammar\n");
#!/usr/bin/perl
# test.pl
use strict;
use warnings;
use Data::Dumper qw( Dumper );
use Grammar qw( );
#$::RD_TRACE = '';
# ----------vvv Example of what you can do vvv----------
sub deparse_suffix {
my ($suffix) = @_;
return '' if !defined($suffix);
my ($sep, @rest) = @$suffix;
return '<' . join($sep, @rest) . '>';
}
sub deparse_body {
my ($body) = @_;
my ($name, $suffix) = @$body;
return $name . deparse_suffix($suffix);
}
sub deparse_bodies {
my ($bodies) = @_;
my $deparsed = join ',', map deparse_body($_), @$bodies;
return "($deparsed)" if @$bodies > 1;
return $deparsed;
}
sub deparse_item {
my ($item) = @_;
my ($prefix, $bodies) = @$_;
my $deparsed = '';
$deparsed .= "<*$prefix>" if defined($prefix);
$deparsed .= deparse_bodies($bodies);
return $deparsed;
}
sub deparse_items {
my ($items) = @_;
return join ',', map deparse_item($_), @$items;
}
# ----------^^^ Example of what you can do ^^^----------
my $parser = Grammar->new();
while (<DATA>) {
chomp;
my $items = $parser->parse($_)
or do { warn("Bad data at line $.\n");
next;
};
print("in: $_\n");
#print Dumper $items;
print("out: ", deparse_items($items), "\n");
print("\n");
}
__DATA__
<*2>FOO<2,1>,<*3>(SigB<8:0:2>,BAR),<*2>Siga<2:0>,Sigb<8,7,6,5,0>
$ perl make_parser.pl && perl test.pl
in: <*2>FOO<2,1>,<*3>(SigB<8:0:2>,BAR),<*2>Siga<2:0>,Sigb<8,7,6,5,0>
out: <*2>FOO<2,1>,<*3>(SigB<8:0:2>,BAR),<*2>Siga<2:0>,Sigb<8,7,6,5,0>
You didn't specify what data you need from the line, so the parser returns everything. It could be simplified if your requirements are more specific.
| [reply] [d/l] [select] |
|
Wow, thanks for your in depth reply! I can see how this can be very powerful but I'm afraid I don't understand
the ParseRec module well enough to expand on the grammar.
I'll see if I can read up on the module :-)
Thanks again!
| [reply] |
|
You still didn't specify what data you need from the line. If you really just want to split on the commans, you can use Text::Balanced.
use strict;
use warnings;
use Text::Balanced qw( extract_bracketed extract_multiple );
while (<DATA>) {
chomp;
print("\n") if $. != 1;
my @extracted = extract_multiple($_, [
',',
\&extract_bracketed,
]);
my @pieces;
push @pieces, '' if @extracted;
for (@extracted) {
if ($_ eq ',') {
push @pieces, '';
} else {
$pieces[-1] .= $_;
}
}
print("$_\n") for @pieces;
}
__DATA__
<*2>FOO<2,1>,<*3>(SigB<8:0:2>,BAR),<*2>Siga<2:0>,Sigb<8,7,6,5,0>
| [reply] [d/l] |
|
Re: split on commas
by Anonymous Monk on Jun 05, 2009 at 18:07 UTC
|
$_ = "<*2>FOO<2,1>,<*3>(SigB<8:0:2>,BAR),<*2>Siga<2:0>,Sigb<8,7,6,5
+,0>";
@line = split(/,(?![\w,]+[>)])/, $_);
print join("\n", @line), $/;
Which prints:
<*2>FOO<2,1>
<*3>(SigB<8:0:2>,BAR)
<*2>Siga<2:0>
Sigb<8,7,6,5,0>
| [reply] [d/l] [select] |
|
Yes, this is what I need. Perhaps you can explain your split line?
Thanks!
| [reply] |
|
#!/usr/bin/env perl
use strict;
use warnings;
use YAPE::Regex::Explain;
print YAPE::Regex::Explain->new(',(?![\w,]+[>)])')->explain();
__END__
The regular expression:
(?-imsx:,(?![\w,]+[>)]))
matches as follows:
NODE EXPLANATION
----------------------------------------------------------------------
(?-imsx: group, but do not capture (case-sensitive)
(with ^ and $ matching normally) (with . not
matching \n) (matching whitespace and #
normally):
----------------------------------------------------------------------
, ','
----------------------------------------------------------------------
(?! look ahead to see if there is not:
----------------------------------------------------------------------
[\w,]+ any character of: word characters (a-z,
A-Z, 0-9, _), ',' (1 or more times
(matching the most amount possible))
----------------------------------------------------------------------
[>)] any character of: '>', ')'
----------------------------------------------------------------------
) end of look-ahead
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
| [reply] [d/l] |
|
|
$_ = "<*2>FOO<2,1>,<*3>(SigB<8:0:2>,BAR),<*2>Siga<2:0>,Sigb<8,7,6,5,0>
+";
@line = split(/ , # comma...
(?! # not followed by
[\w,]+ # "word" characters or commas and
[>)] # a close-bracket or close-paren
)
/x, $_);
print join($/, @line), $/;
| [reply] [d/l] |
|
|
|
$string = "(1,2,3,<4,5,6>),more";
my @line = split(/,(?![\w,]+[>)])/, $string);
print Dumper(\@line);
produces
$VAR1 = [
'(1',
'2',
'3',
'<4,5,6>)',
'more'
];
The problem is any commented comma that is followed by an opening delimiter before a closing delimiter. I suspect it is not possible to generalize the solution without counting the delimiters. | [reply] [d/l] [select] |
Re: split on commas
by Herkum (Parson) on Jun 05, 2009 at 17:44 UTC
|
| [reply] |
|
I took a look at the modules you recommended, however I don't think these will help as the commas inside '<>' and '()'
characters will still be parsed as they are not contained with quotes
| [reply] |
Re: split on commas
by hma (Pilgrim) on Jun 05, 2009 at 18:55 UTC
|
I successfully applied Ovid's module Data::Record - "split on steroids" - to a similar problem. | [reply] |
Re: split on commas
by ig (Vicar) on Jun 06, 2009 at 11:34 UTC
|
use strict;
use warnings;
use Data::Dumper;
use Regexp::Common qw/balanced/;
my $string = "<*2>FOO<2,1>,<*3>(SigB<8:0:2>,BAR),<*2>Siga<2:0>,Sigb<8,
+7,6,5,0>";
print "$string\n";
my @parts = $string =~ m/(?:$RE{balanced}{-parens=>'()<>'}|[^,<(]+)+/g
+;
print Dumper(\@parts);
which produces
<*2>FOO<2,1>,<*3>(SigB<8:0:2>,BAR),<*2>Siga<2:0>,Sigb<8,7,6,5,0>
$VAR1 = [
'<*2>FOO<2,1>',
'<*3>(SigB<8:0:2>,BAR)',
'<*2>Siga<2:0>',
'Sigb<8,7,6,5,0>'
];
| [reply] [d/l] [select] |
Re: split on commas
by Anonymous Monk on Jun 05, 2009 at 18:32 UTC
|
How about
@fields = split /[>)],[<(]/ , $line; | [reply] [d/l] |
Re: split on commas
by JavaFan (Canon) on Jun 07, 2009 at 00:36 UTC
|
I wouldn't use split. It's much easier to define what you want to keep:
$string = "<*2>FOO<2,1>,<*3>(SigB<8:0:2>,BAR),<*2>Siga<2:0>,Sigb<8,7,6
+,5,0>";
@parts = $string =~ /((?:[^<(,]+|<[^>]*>|\([^)]+\))+)/g;
say for @parts;
__END__
<*2>FOO<2,1>
<*3>(SigB<8:0:2>,BAR)
<*2>Siga<2:0>
Sigb<8,7,6,5,0>
This assumes no nesting of parenthesis/pointed brackets. | [reply] [d/l] |
Re: split on commas
by Anonymous Monk on Jun 07, 2009 at 15:39 UTC
|
first of all,
check the start marker of a '<' or'(',
leave every character as it is in between till you get a
respective end marker '>' for '<' and ')' for '('.
split other times the commas with '|||'.
may b this help..
Abhishek.
abhi12524@yahoo.com | [reply] |
Re: split on commas
by tphyahoo (Vicar) on Jun 08, 2009 at 21:11 UTC
|
Not a beginner answer, but the "principled" way to solve this is to use a parser, not a regex.
Parse::RecDescent is such a parser.
However, as the other answers have said, there are ways to "cheat" around this without using a full blown parser. The cheats may be brittle... for example, what happens if you have brackets inside a bracket, and commas inside that? However they'll work for most purposes. | [reply] |
Re: split on commas
by John M. Dlugosz (Monsignor) on Jun 10, 2009 at 17:29 UTC
|
I recall that Perl 5.10 added some regex features including full recursive sub-regexes and that should allow "parsing" to a greater degree than we had before. The example in Perldelta matches nested angle brackets. I recall a more detailed overview of new Perl 5.10 features somewhere that showed this and also external regexes included by reference.
| [reply] |
|
|