Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

Complex Splitting

by neversaint (Deacon)
on Feb 06, 2007 at 13:27 UTC ( [id://598530]=perlquestion: print w/replies, xml ) Need Help??

neversaint has asked for the wisdom of the Perl Monks concerning the following question:

Dear Masters,
How can we split this string:
my $str = "ABC[GHI]XYZ"; # Note that in actual situation string may contain several # such bracketed alphabets.
into such array:
my @array = ("A", "B", "C", "GHI", "X", "Y", "Z");
meaning every characters in the bracket - [GHI] are treated as single entity in an array. I tried this:
perl -MData::Dumper -e ' $str ="ABC[GHI]XYZ"; @arr= split(/[\[\]]/,$str); print Dumper \@arr;'
but doesn't do the job.

---
neversaint and everlastingly indebted.......

Replies are listed 'Best First'.
Re: Complex Splitting
by holli (Abbot) on Feb 06, 2007 at 13:40 UTC
    Quickshot:
    push @array, $2 ? $2 : $3 while $str =~ /(([A-Z])|\[([A-Z]+)\])/g;


    holli, /regexed monk/

      In perl 5.10, you don't need to do backflips to get away from the "extra" captures. (?^...|...) means the captures get renumbered so here both are $1. Only one ever actually matches at any time so there's no conflict.

      push @array, $1 while $str =~ /(?^(\w+)|\[(\w+)\])/g;

      ⠤⠤ ⠙⠊⠕⠞⠁⠇⠑⠧⠊

Re: Complex Splitting
by rinceWind (Monsignor) on Feb 06, 2007 at 13:40 UTC

    You're better of using a matching regex rather than a split:

    @arr = $str =~ /(\w|\[\w+\])/g;

    Update: thanks to monarch for pointing this out. The above leaves the brackets behind. One option is to split the capture and filter out the undefs thus:

    @arr = grep $_, $str =~ /(\w)|\[(\w+)\]/g;

    --

    Oh Lord, won’t you burn me a Knoppix CD ?
    My friends all rate Windows, I must disagree.
    Your powers of persuasion will set them all free,
    So oh Lord, won’t you burn me a Knoppix CD ?
    (Missquoting Janis Joplin)

       @arr = grep {defined} $str =~ /(\w)|\[(\w+)\]/g; gets exactly what you wanted without the brackets. Not sure how to avoid the undef's except grepping them out.


      ___________
      Eric Hodges
Re: Complex Splitting
by johngg (Canon) on Feb 06, 2007 at 14:19 UTC
    You can do this with a split but the pattern is a lot trickier than that for the global match. Advantage is you don't have to worry about grep'ing out undefined values.

    use strict; use warnings; use Data::Dumper; my $str = q{ABC[GHI]XYZ}; my $rxSplit = qr {(?x) # use extended syntax (?: # non-capturing group \[ # literal opening square br. | # or \] # literal closing square br. | # or (?= # look-ahead, a point # followed by [^]]+ # one or more non-closing # square brs. (?: # non-capturing group # then either \[ # literal opening square br. | # or \z # end of string ) # close non-capturing group ) # close look-ahead ) # close non-capturing group }; my @array = split m{$rxSplit}, $str; print Data::Dumper->Dump([\@array], [qw{*array}]);

    Here's the output.

    @array = ( 'A', 'B', 'C', 'GHI', 'X', 'Y', 'Z' );

    On balance, I'd go with the global match, something like holli's or eric256's solutions.

    Cheers,

    JohnGG

      I don't think it's a lot trickier. This oughta do it:
      @array = grep defined, split /(\[.*?\])|/;
      That is, if you split on a bracketed group, capture it. If you don't have a bracketed group, split on nothing. Then filter out where the capture didn't match.

      Caution: Contents may have been coded under pressure.
        Yes, I regretted the word "lot" as soon as I'd posted, made it sound as if I'd sweated blood writing it. Although I like the simpler approach of your solution there is one problem with it; it doesn't get rid of the square brackets, which was required by the OP.

        Cheers,

        JohnGG

Re: Complex Splitting - Parse::RecDescent
by imp (Priest) on Feb 06, 2007 at 15:02 UTC
    If the string being split will always be well-formed then I would go with one of the regex solutions provided above. If there is a possiblity that the data will be malformed then you may be better off with a parser approach as it would allow for more flexibility in error handling.

    Here is a solution using Parse::RecDescent.

    use Parse::RecDescent; use strict; use warnings; my $str = "ABC[GHI]XY[Z]1A"; my $grammar = <<'GRAMMAR'; token : '[' /[A-Z]*/ ']' {$return = $item[2]} | /[A-Z]/ anything : /./ GRAMMAR my $parser = Parse::RecDescent->new($grammar); # When a reference to a scalar is passed to Parse::RecDescent it will # consume the tokens as they are matched. To avoid modifying the origi +nal # string a copy will be used my $copy = $str; while ($copy ne '') { if (my $token = $parser->token(\$copy)) { print "Token: $token\n"; } else { my $token = $parser->anything(\$copy); print "Invalid symbol: $token\n"; } }

      That silently ignores whitespace (read up on <skip>).

      Also, P::RD is rather slow. I'd even say inexcusably slow if you're just using it as a tokenizer. May I suggest a much faster tokenizer?

      use strict; use warnings; sub process_token { my ($token) = @_; print("Token: $token\n"); } { my $str = "ABC[GHI]XY[Z]1A"; for ($str) { /\G \[ ([A-Z]*) \] /xgcs && do { process_token("$1"); redo }; /\G ([A-Z]) /xgcs && do { process_token("$1"); redo }; /\G (.) /xgcs && do { printf("Unexpected '%s' at pos %d\n", $1, pos()-length($1)); redo }; } }
Re: Complex Splitting
by tcf03 (Deacon) on Feb 06, 2007 at 14:03 UTC
    This should do the trick - Its only tested on your string though...
    #!/usr/bin/perl use strict; use warnings; my $string = "ABC[GHI]XYZ"; my @array; splitstring($string); push @array, split//, $string; print "$_ " for sort @array; sub splitstring { return unless ( $string =~ /.*\[\w+\].*/ ); if ( $string =~ /.*(\[(\w+)\]).*/ ) { my $remove = $1; my $str = $2; print "rem = $remove\n str = $str\n"; push @array, $str; $string =~ s/\Q$remove\E//; } splitstring($string); }
    UPDATE
    I substututed the string with [987]ABC[GHI]XYZ[dfg] and that works as well.

    regards

    Ted
    --
    "That which we persist in doing becomes easier, not that the task itself has become easier, but that our ability to perform it has improved."
      --Ralph Waldo Emerson
Re: Complex Splitting
by shmem (Chancellor) on Feb 06, 2007 at 14:57 UTC
    perl -le '$_="[987]ABC[GHI]XYZ[dfg]"; print "$_" for split/(?!\w+\])|\ +[|]/,$_'

    but that spits out extra fields. Let's add a grep:

    perl -le '$_="[987]ABC[GHI]XYZ[dfg]";print "$_" for grep{$_}split/(?!\ +w+\])|\[|]/,$_'

    --shmem

    _($_=" "x(1<<5)."?\n".q·/)Oo.  G°\        /
                                  /\_¯/(q    /
    ----------------------------  \__(m.====·.(_("always off the crowd"))."·
    ");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}
Re: Complex Splitting
by fenLisesi (Priest) on Feb 06, 2007 at 17:53 UTC
    Here is another attempt at expanding a one-liner into a big script, this one using a canned /regexp?/i:
    use strict; use warnings; use Regexp::Common qw(balanced); my @streams = (qw( [ABC][DEF][HIJ] ABC[DEF]HI[JK][LMNO]PQ [A][B]C[DEF]HI[JK][LMNO]PQ [A][B]C[DEF]HI[JK][LMNO]P[Q] [ABCDEFHIJKLMNOPQ] ABCDEF ), q()); my $PATTERN = $RE{balanced}{-parens=>'[]'}; for my $input (@streams) { print qq($input => ); my @pieces = (); while ($input =~ s/(\w*)($PATTERN)//) { my ($prematch, $match) = ($1, $2); push @pieces, split //, $prematch; push @pieces, $1 if $match =~ /(\w+)/; } push @pieces, split //, $input if length $input; printf qq(%s\n), join qq( ), @pieces; } __END__
    which prints:
    [ABC][DEF][HIJ] => ABC DEF HIJ ABC[DEF]HI[JK][LMNO]PQ => A B C DEF H I JK LMNO P Q [A][B]C[DEF]HI[JK][LMNO]PQ => A B C DEF H I JK LMNO P Q [A][B]C[DEF]HI[JK][LMNO]P[Q] => A B C DEF H I JK LMNO P Q [ABCDEFHIJKLMNOPQ] => ABCDEFHIJKLMNOPQ ABCDEF => A B C D E F =>
    Cheers.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://598530]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others wandering the Monastery: (5)
As of 2025-06-23 22:31 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?
    erzuuliAnonymous Monks are no longer allowed to use Super Search, due to an excessive use of this resource by robots.