Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

split on commas

by pip9ball (Acolyte)
on Jun 05, 2009 at 17:33 UTC ( #768859=perlquestion: print w/ replies, xml ) Need Help??
pip9ball has asked for the wisdom of the Perl Monks concerning the following question:

Hi,
I have the following string:

my $string = "<*2>FOO<2,1>,<*3>(SigB<8:0:2>,BAR),<*2>Siga<2:0>,Sigb<8,7,6,5,0>";

I need a way to split on commas, but not the commas inside
'<>' or '()' characters

What I've been trying to do is first replace all commas
inside the '<>' and '()' with '|||', then to the split on commas,
followed by another replacement of '|||' to commas.

Is there a better way to do this?

Thanks!

Comment on split on commas
Re: split on commas
by Herkum (Parson) on Jun 05, 2009 at 17:44 UTC
      I took a look at the modules you recommended, however
      I don't think these will help as the commas inside '<>' and '()'
      characters will still be parsed as they are not contained with quotes

Re: split on commas
by Anonymous Monk on Jun 05, 2009 at 18:07 UTC
    Perhaps this?
    $_ = "<*2>FOO<2,1>,<*3>(SigB<8:0:2>,BAR),<*2>Siga<2:0>,Sigb<8,7,6,5 +,0>"; @line = split(/,(?![\w,]+[>)])/, $_); print join("\n", @line), $/;
    Which prints:
    <*2>FOO<2,1> <*3>(SigB<8:0:2>,BAR) <*2>Siga<2:0> Sigb<8,7,6,5,0>
      Yes, this is what I need. Perhaps you can explain your split line?

      Thanks!
        With comments:
        $_ = "<*2>FOO<2,1>,<*3>(SigB<8:0:2>,BAR),<*2>Siga<2:0>,Sigb<8,7,6,5,0> +"; @line = split(/ , # comma... (?! # not followed by [\w,]+ # "word" characters or commas and [>)] # a close-bracket or close-paren ) /x, $_); print join($/, @line), $/;
        Using YAPE::Regex::Explain:
        #!/usr/bin/env perl use strict; use warnings; use YAPE::Regex::Explain; print YAPE::Regex::Explain->new(',(?![\w,]+[>)])')->explain(); __END__ The regular expression: (?-imsx:,(?![\w,]+[>)])) matches as follows: NODE EXPLANATION ---------------------------------------------------------------------- (?-imsx: group, but do not capture (case-sensitive) (with ^ and $ matching normally) (with . not matching \n) (matching whitespace and # normally): ---------------------------------------------------------------------- , ',' ---------------------------------------------------------------------- (?! look ahead to see if there is not: ---------------------------------------------------------------------- [\w,]+ any character of: word characters (a-z, A-Z, 0-9, _), ',' (1 or more times (matching the most amount possible)) ---------------------------------------------------------------------- [>)] any character of: '>', ')' ---------------------------------------------------------------------- ) end of look-ahead ---------------------------------------------------------------------- ) end of grouping ----------------------------------------------------------------------

      This does not work for some strings.

      $string = "(1,2,3,<4,5,6>),more"; my @line = split(/,(?![\w,]+[>)])/, $string); print Dumper(\@line);

      produces

      $VAR1 = [ '(1', '2', '3', '<4,5,6>)', 'more' ];

      The problem is any commented comma that is followed by an opening delimiter before a closing delimiter. I suspect it is not possible to generalize the solution without counting the delimiters.

Re: split on commas
by Anonymous Monk on Jun 05, 2009 at 18:32 UTC
    How about @fields = split /[>)],[<(]/ , $line;
Re: split on commas
by hma (Pilgrim) on Jun 05, 2009 at 18:55 UTC
    I successfully applied Ovid's module Data::Record - "split on steroids" - to a similar problem.
Re: split on commas (Parse::RecDescent)
by ikegami (Pope) on Jun 05, 2009 at 19:12 UTC

    Considering you're constantly adding more requirements, it's my opinion that it's time to move to a full-fledged parser.

    #!/usr/bin/perl # make_parser.pl use strict; use warnings; use Parse::RecDescent qw( ); my $grammar = <<'__END_OF_GRAMMAR__'; { use strict; use warnings; } parse : <skip:''> item_list /\Z/ { $item[2] } item_list : <leftop: item ',' item> { $item[1] } item : prefix bodies { [ $item[1], $item[2] ] } | bodies { [ undef, $item[1] ] } bodies : group { $item[1] } | body { [ $item[1] ] } prefix : '<*' INTEGER '>' { $item[2] } group : '(' <leftop: body ',' body> ')' { $item[2] } body : NAME suffix(?) { [ $item[1], @{ $item[2] } ] } suffix : '<' suffix_list '>' { $item[2] } suffix_list : INTEGER suffix_list_[ $item[1] ] suffix_list_ : ',' <leftop: INTEGER ',' INTEGER> { [ ',' => $arg[0], @{ $item[2] } ] } | ':' <leftop: INTEGER ':' INTEGER> { [ ':' => $arg[0], @{ $item[2] } ] } | { [ '' => $arg[0] ] } INTEGER : /0|[1-9][0-9]*/ NAME : /[A-Za-z][A-Za-z0-9_]*/ __END_OF_GRAMMAR__ Parse::RecDescent->Precompile($grammar, 'Grammar') or die("Bad grammar\n");
    #!/usr/bin/perl # test.pl use strict; use warnings; use Data::Dumper qw( Dumper ); use Grammar qw( ); #$::RD_TRACE = ''; # ----------vvv Example of what you can do vvv---------- sub deparse_suffix { my ($suffix) = @_; return '' if !defined($suffix); my ($sep, @rest) = @$suffix; return '<' . join($sep, @rest) . '>'; } sub deparse_body { my ($body) = @_; my ($name, $suffix) = @$body; return $name . deparse_suffix($suffix); } sub deparse_bodies { my ($bodies) = @_; my $deparsed = join ',', map deparse_body($_), @$bodies; return "($deparsed)" if @$bodies > 1; return $deparsed; } sub deparse_item { my ($item) = @_; my ($prefix, $bodies) = @$_; my $deparsed = ''; $deparsed .= "<*$prefix>" if defined($prefix); $deparsed .= deparse_bodies($bodies); return $deparsed; } sub deparse_items { my ($items) = @_; return join ',', map deparse_item($_), @$items; } # ----------^^^ Example of what you can do ^^^---------- my $parser = Grammar->new(); while (<DATA>) { chomp; my $items = $parser->parse($_) or do { warn("Bad data at line $.\n"); next; }; print("in: $_\n"); #print Dumper $items; print("out: ", deparse_items($items), "\n"); print("\n"); } __DATA__ <*2>FOO<2,1>,<*3>(SigB<8:0:2>,BAR),<*2>Siga<2:0>,Sigb<8,7,6,5,0>
    $ perl make_parser.pl && perl test.pl in: <*2>FOO<2,1>,<*3>(SigB<8:0:2>,BAR),<*2>Siga<2:0>,Sigb<8,7,6,5,0> out: <*2>FOO<2,1>,<*3>(SigB<8:0:2>,BAR),<*2>Siga<2:0>,Sigb<8,7,6,5,0>

    You didn't specify what data you need from the line, so the parser returns everything. It could be simplified if your requirements are more specific.

      Wow, thanks for your in depth reply! I can see how this can be very powerful but I'm afraid I don't understand
      the ParseRec module well enough to expand on the grammar.

      I'll see if I can read up on the module :-)

      Thanks again!
        You still didn't specify what data you need from the line. If you really just want to split on the commans, you can use Text::Balanced.
        use strict; use warnings; use Text::Balanced qw( extract_bracketed extract_multiple ); while (<DATA>) { chomp; print("\n") if $. != 1; my @extracted = extract_multiple($_, [ ',', \&extract_bracketed, ]); my @pieces; push @pieces, '' if @extracted; for (@extracted) { if ($_ eq ',') { push @pieces, ''; } else { $pieces[-1] .= $_; } } print("$_\n") for @pieces; } __DATA__ <*2>FOO<2,1>,<*3>(SigB<8:0:2>,BAR),<*2>Siga<2:0>,Sigb<8,7,6,5,0>
Re: split on commas
by ig (Vicar) on Jun 06, 2009 at 11:34 UTC

    Another option is:

    use strict; use warnings; use Data::Dumper; use Regexp::Common qw/balanced/; my $string = "<*2>FOO<2,1>,<*3>(SigB<8:0:2>,BAR),<*2>Siga<2:0>,Sigb<8, +7,6,5,0>"; print "$string\n"; my @parts = $string =~ m/(?:$RE{balanced}{-parens=>'()<>'}|[^,<(]+)+/g +; print Dumper(\@parts);

    which produces

    <*2>FOO<2,1>,<*3>(SigB<8:0:2>,BAR),<*2>Siga<2:0>,Sigb<8,7,6,5,0> $VAR1 = [ '<*2>FOO<2,1>', '<*3>(SigB<8:0:2>,BAR)', '<*2>Siga<2:0>', 'Sigb<8,7,6,5,0>' ];
Re: split on commas
by JavaFan (Canon) on Jun 07, 2009 at 00:36 UTC
    I wouldn't use split. It's much easier to define what you want to keep:
    $string = "<*2>FOO<2,1>,<*3>(SigB<8:0:2>,BAR),<*2>Siga<2:0>,Sigb<8,7,6 +,5,0>"; @parts = $string =~ /((?:[^<(,]+|<[^>]*>|\([^)]+\))+)/g; say for @parts; __END__ <*2>FOO<2,1> <*3>(SigB<8:0:2>,BAR) <*2>Siga<2:0> Sigb<8,7,6,5,0>
    This assumes no nesting of parenthesis/pointed brackets.
Re: split on commas
by Anonymous Monk on Jun 07, 2009 at 15:39 UTC
    first of all, check the start marker of a '<' or'(', leave every character as it is in between till you get a respective end marker '>' for '<' and ')' for '('. split other times the commas with '|||'. may b this help.. Abhishek. abhi12524@yahoo.com
Re: split on commas
by tphyahoo (Vicar) on Jun 08, 2009 at 21:11 UTC
    Not a beginner answer, but the "principled" way to solve this is to use a parser, not a regex.

    Parse::RecDescent is such a parser.

    However, as the other answers have said, there are ways to "cheat" around this without using a full blown parser. The cheats may be brittle... for example, what happens if you have brackets inside a bracket, and commas inside that? However they'll work for most purposes.

Re: split on commas
by John M. Dlugosz (Monsignor) on Jun 10, 2009 at 17:29 UTC
    I recall that Perl 5.10 added some regex features including full recursive sub-regexes and that should allow "parsing" to a greater degree than we had before. The example in Perldelta matches nested angle brackets. I recall a more detailed overview of new Perl 5.10 features somewhere that showed this and also external regexes included by reference.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://768859]
Approved by AnomalousMonk
Front-paged by linuxer
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others pondering the Monastery: (3)
As of 2014-10-25 00:55 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    For retirement, I am banking on:










    Results (138 votes), past polls