split on commas

pip9ball has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: split on commas (Parse::RecDescent) by ikegami (Patriarch) on Jun 05, 2009 at 19:12 UTC
Considering you're constantly adding more requirements, it's my opinion that it's time to move to a full-fledged parser. #!/usr/bin/perl # make_parser.pl use strict; use warnings; use Parse::RecDescent qw( ); my $grammar = <<'__END_OF_GRAMMAR__'; { use strict; use warnings; } parse : <skip:''> item_list /\Z/ { $item[2] } item_list : <leftop: item ',' item> { $item[1] } item : prefix bodies { [ $item[1], $item[2] ] } \| bodies { [ undef, $item[1] ] } bodies : group { $item[1] } \| body { [ $item[1] ] } prefix : '<' INTEGER '>' { $item[2] } group : '(' <leftop: body ',' body> ')' { $item[2] } body : NAME suffix(?) { [ $item[1], @{ $item[2] } ] } suffix : '<' suffix_list '>' { $item[2] } suffix_list : INTEGER suffix_list_[ $item[1] ] suffix_list_ : ',' <leftop: INTEGER ',' INTEGER> { [ ',' => $arg[0], @{ $item[2] } ] } \| ':' <leftop: INTEGER ':' INTEGER> { [ ':' => $arg[0], @{ $item[2] } ] } \| { [ '' => $arg[0] ] } INTEGER : /0\|[1-9][0-9]/ NAME : /[A-Za-z][A-Za-z0-9_]/ __END_OF_GRAMMAR__ Parse::RecDescent->Precompile($grammar, 'Grammar') or die("Bad grammar\n"); [download] #!/usr/bin/perl # test.pl use strict; use warnings; use Data::Dumper qw( Dumper ); use Grammar qw( ); #$::RD_TRACE = ''; # ----------vvv Example of what you can do vvv---------- sub deparse_suffix { my ($suffix) = @_; return '' if !defined($suffix); my ($sep, @rest) = @$suffix; return '<' . join($sep, @rest) . '>'; } sub deparse_body { my ($body) = @_; my ($name, $suffix) = @$body; return $name . deparse_suffix($suffix); } sub deparse_bodies { my ($bodies) = @_; my $deparsed = join ',', map deparse_body($_), @$bodies; return "($deparsed)" if @$bodies > 1; return $deparsed; } sub deparse_item { my ($item) = @_; my ($prefix, $bodies) = @$_; my $deparsed = ''; $deparsed .= "<$prefix>" if defined($prefix); $deparsed .= deparse_bodies($bodies); return $deparsed; } sub deparse_items { my ($items) = @_; return join ',', map deparse_item($_), @$items; } # ----------^^^ Example of what you can do ^^^---------- my $parser = Grammar->new(); while (<DATA>) { chomp; my $items = $parser->parse($_) or do { warn("Bad data at line $.\n"); next; }; print("in: $_\n"); #print Dumper $items; print("out: ", deparse_items($items), "\n"); print("\n"); } __DATA__ <2>FOO<2,1>,<3>(SigB<8:0:2>,BAR),<2>Siga<2:0>,Sigb<8,7,6,5,0> [download] `$ perl make_parser.pl && perl test.pl in: <2>FOO<2,1>,<3>(SigB<8:0:2>,BAR),<2>Siga<2:0>,Sigb<8,7,6,5,0> out: <2>FOO<2,1>,<3>(SigB<8:0:2>,BAR),<*2>Siga<2:0>,Sigb<8,7,6,5,0>` [download] You didn't specify what data you need from the line, so the parser returns everything. It could be simplified if your requirements are more specific.	[reply] [d/l] [select]
Re^2: split on commas by pip9ball (Acolyte) on Jun 05, 2009 at 22:08 UTC
Wow, thanks for your in depth reply! I can see how this can be very powerful but I'm afraid I don't understand the ParseRec module well enough to expand on the grammar. I'll see if I can read up on the module :-) Thanks again!	[reply]
Re^3: split on commas (Text::Balanced) by ikegami (Patriarch) on Jun 09, 2009 at 18:17 UTC
You still didn't specify what data you need from the line. If you really just want to split on the commans, you can use Text::Balanced. `use strict; use warnings; use Text::Balanced qw( extract_bracketed extract_multiple ); while (<DATA>) { chomp; print("\n") if $. != 1; my @extracted = extract_multiple($_, [ ',', \&extract_bracketed, ]); my @pieces; push @pieces, '' if @extracted; for (@extracted) { if ($_ eq ',') { push @pieces, ''; } else { $pieces[-1] .= $_; } } print("$_\n") for @pieces; } __DATA__ <2>FOO<2,1>,<3>(SigB<8:0:2>,BAR),<*2>Siga<2:0>,Sigb<8,7,6,5,0>` [download]	[reply] [d/l]
Re^4: split on commas (Text::Balanced) by pip9ball (Acolyte) on Jun 10, 2009 at 17:38 UTC
Re: split on commas by Anonymous Monk on Jun 05, 2009 at 18:07 UTC
Perhaps this? `$_ = "<2>FOO<2,1>,<3>(SigB<8:0:2>,BAR),<2>Siga<2:0>,Sigb<8,7,6,5 +,0>"; @line = split(/,(?![\w,]+[>)])/, $_); print join("\n", @line), $/;` [download] Which prints: `<2>FOO<2,1> <3>(SigB<8:0:2>,BAR) <2>Siga<2:0> Sigb<8,7,6,5,0>` [download]	[reply] [d/l] [select]
Re^2: split on commas by pip9ball (Acolyte) on Jun 05, 2009 at 18:10 UTC
Yes, this is what I need. Perhaps you can explain your split line? Thanks!	[reply]
Re^3: split on commas by toolic (Bishop) on Jun 05, 2009 at 18:39 UTC
Using YAPE::Regex::Explain: #!/usr/bin/env perl use strict; use warnings; use YAPE::Regex::Explain; print YAPE::Regex::Explain->new(',(?![\w,]+[>)])')->explain(); __END__ The regular expression: (?-imsx:,(?![\w,]+[>)])) matches as follows: NODE EXPLANATION ---------------------------------------------------------------------- (?-imsx: group, but do not capture (case-sensitive) (with ^ and $ matching normally) (with . not matching \n) (matching whitespace and # normally): ---------------------------------------------------------------------- , ',' ---------------------------------------------------------------------- (?! look ahead to see if there is not: ---------------------------------------------------------------------- [\w,]+ any character of: word characters (a-z, A-Z, 0-9, _), ',' (1 or more times (matching the most amount possible)) ---------------------------------------------------------------------- [>)] any character of: '>', ')' ---------------------------------------------------------------------- ) end of look-ahead ---------------------------------------------------------------------- ) end of grouping ---------------------------------------------------------------------- [download]	[reply] [d/l]
Re^4: split on commas by pip9ball (Acolyte) on Jun 05, 2009 at 22:13 UTC
Re^3: split on commas by Anonymous Monk on Jun 05, 2009 at 18:37 UTC
With comments: `$_ = "<2>FOO<2,1>,<3>(SigB<8:0:2>,BAR),<*2>Siga<2:0>,Sigb<8,7,6,5,0> +"; @line = split(/ , # comma... (?! # not followed by [\w,]+ # "word" characters or commas and [>)] # a close-bracket or close-paren ) /x, $_); print join($/, @line), $/;` [download]	[reply] [d/l]
Re^4: split on commas by pip9ball (Acolyte) on Jun 05, 2009 at 22:12 UTC
Re^5: split on commas by tphyahoo (Vicar) on Jun 09, 2009 at 18:06 UTC
Re^2: split on commas by ig (Vicar) on Jun 06, 2009 at 11:22 UTC
This does not work for some strings. `$string = "(1,2,3,<4,5,6>),more"; my @line = split(/,(?![\w,]+[>)])/, $string); print Dumper(\@line);` [download] produces `$VAR1 = [ '(1', '2', '3', '<4,5,6>)', 'more' ];` [download] The problem is any commented comma that is followed by an opening delimiter before a closing delimiter. I suspect it is not possible to generalize the solution without counting the delimiters.	[reply] [d/l] [select]
Re: split on commas by Herkum (Parson) on Jun 05, 2009 at 17:44 UTC
Have you considered Text::CSV or Parse::CSV?	[reply]
Re^2: split on commas by pip9ball (Acolyte) on Jun 05, 2009 at 18:04 UTC
I took a look at the modules you recommended, however I don't think these will help as the commas inside '<>' and '()' characters will still be parsed as they are not contained with quotes	[reply]
Re: split on commas by hma (Pilgrim) on Jun 05, 2009 at 18:55 UTC
I successfully applied Ovid's module Data::Record - "split on steroids" - to a similar problem.	[reply]
Re: split on commas by ig (Vicar) on Jun 06, 2009 at 11:34 UTC
Another option is: `use strict; use warnings; use Data::Dumper; use Regexp::Common qw/balanced/; my $string = "<2>FOO<2,1>,<3>(SigB<8:0:2>,BAR),<2>Siga<2:0>,Sigb<8, +7,6,5,0>"; print "$string\n"; my @parts = $string =~ m/(?:$RE{balanced}{-parens=>'()<>'}\|[^,<(]+)+/g +; print Dumper(\@parts);` [download] which produces `<2>FOO<2,1>,<3>(SigB<8:0:2>,BAR),<2>Siga<2:0>,Sigb<8,7,6,5,0> $VAR1 = [ '<2>FOO<2,1>', '<3>(SigB<8:0:2>,BAR)', '<*2>Siga<2:0>', 'Sigb<8,7,6,5,0>' ];` [download]	[reply] [d/l] [select]
Re: split on commas by Anonymous Monk on Jun 05, 2009 at 18:32 UTC
How about `@fields = split /[>)],[<(]/ , $line;`	[reply] [d/l]
Re: split on commas by JavaFan (Canon) on Jun 07, 2009 at 00:36 UTC
I wouldn't use split. It's much easier to define what you want to keep: `$string = "<2>FOO<2,1>,<3>(SigB<8:0:2>,BAR),<2>Siga<2:0>,Sigb<8,7,6 +,5,0>"; @parts = $string =~ /((?:[^<(,]+\|<[^>]>\|$[^)]+$)+)/g; say for @parts; __END__ <2>FOO<2,1> <3>(SigB<8:0:2>,BAR) <*2>Siga<2:0> Sigb<8,7,6,5,0>` [download] This assumes no nesting of parenthesis/pointed brackets.	[reply] [d/l]
Re: split on commas by Anonymous Monk on Jun 07, 2009 at 15:39 UTC
first of all, check the start marker of a '<' or'(', leave every character as it is in between till you get a respective end marker '>' for '<' and ')' for '('. split other times the commas with '\|\|\|'. may b this help.. Abhishek. abhi12524@yahoo.com	[reply]
Re: split on commas by tphyahoo (Vicar) on Jun 08, 2009 at 21:11 UTC
Not a beginner answer, but the "principled" way to solve this is to use a parser, not a regex. Parse::RecDescent is such a parser. However, as the other answers have said, there are ways to "cheat" around this without using a full blown parser. The cheats may be brittle... for example, what happens if you have brackets inside a bracket, and commas inside that? However they'll work for most purposes.	[reply]
Re: split on commas by John M. Dlugosz (Monsignor) on Jun 10, 2009 at 17:29 UTC
I recall that Perl 5.10 added some regex features including full recursive sub-regexes and that should allow "parsing" to a greater degree than we had before. The example in Perldelta matches nested angle brackets. I recall a more detailed overview of new Perl 5.10 features somewhere that showed this and also external regexes included by reference.	[reply]


more useful options
	PerlMonks