Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

Complex Splitting

by neversaint (Deacon)
on Feb 06, 2007 at 13:27 UTC ( #598530=perlquestion: print w/replies, xml ) Need Help??

neversaint has asked for the wisdom of the Perl Monks concerning the following question:

Dear Masters,
How can we split this string:
my $str = "ABC[GHI]XYZ"; # Note that in actual situation string may contain several # such bracketed alphabets.
into such array:
my @array = ("A", "B", "C", "GHI", "X", "Y", "Z");
meaning every characters in the bracket - [GHI] are treated as single entity in an array. I tried this:
perl -MData::Dumper -e ' $str ="ABC[GHI]XYZ"; @arr= split(/[\[\]]/,$str); print Dumper \@arr;'
but doesn't do the job.

---
neversaint and everlastingly indebted.......

Replies are listed 'Best First'.
Re: Complex Splitting
by holli (Abbot) on Feb 06, 2007 at 13:40 UTC
    Quickshot:
    push @array, $2 ? $2 : $3 while $str =~ /(([A-Z])|\[([A-Z]+)\])/g;


    holli, /regexed monk/

      In perl 5.10, you don't need to do backflips to get away from the "extra" captures. (?^...|...) means the captures get renumbered so here both are $1. Only one ever actually matches at any time so there's no conflict.

      push @array, $1 while $str =~ /(?^(\w+)|\[(\w+)\])/g;

      ⠤⠤ ⠙⠊⠕⠞⠁⠇⠑⠧⠊

Re: Complex Splitting
by rinceWind (Monsignor) on Feb 06, 2007 at 13:40 UTC

    You're better of using a matching regex rather than a split:

    @arr = $str =~ /(\w|\[\w+\])/g;

    Update: thanks to monarch for pointing this out. The above leaves the brackets behind. One option is to split the capture and filter out the undefs thus:

    @arr = grep $_, $str =~ /(\w)|\[(\w+)\]/g;

    --

    Oh Lord, won’t you burn me a Knoppix CD ?
    My friends all rate Windows, I must disagree.
    Your powers of persuasion will set them all free,
    So oh Lord, won’t you burn me a Knoppix CD ?
    (Missquoting Janis Joplin)

       @arr = grep {defined} $str =~ /(\w)|\[(\w+)\]/g; gets exactly what you wanted without the brackets. Not sure how to avoid the undef's except grepping them out.


      ___________
      Eric Hodges
Re: Complex Splitting
by johngg (Canon) on Feb 06, 2007 at 14:19 UTC
    You can do this with a split but the pattern is a lot trickier than that for the global match. Advantage is you don't have to worry about grep'ing out undefined values.

    use strict; use warnings; use Data::Dumper; my $str = q{ABC[GHI]XYZ}; my $rxSplit = qr {(?x) # use extended syntax (?: # non-capturing group \[ # literal opening square br. | # or \] # literal closing square br. | # or (?= # look-ahead, a point # followed by [^]]+ # one or more non-closing # square brs. (?: # non-capturing group # then either \[ # literal opening square br. | # or \z # end of string ) # close non-capturing group ) # close look-ahead ) # close non-capturing group }; my @array = split m{$rxSplit}, $str; print Data::Dumper->Dump([\@array], [qw{*array}]);

    Here's the output.

    @array = ( 'A', 'B', 'C', 'GHI', 'X', 'Y', 'Z' );

    On balance, I'd go with the global match, something like holli's or eric256's solutions.

    Cheers,

    JohnGG

      I don't think it's a lot trickier. This oughta do it:
      @array = grep defined, split /(\[.*?\])|/;
      That is, if you split on a bracketed group, capture it. If you don't have a bracketed group, split on nothing. Then filter out where the capture didn't match.

      Caution: Contents may have been coded under pressure.
        Yes, I regretted the word "lot" as soon as I'd posted, made it sound as if I'd sweated blood writing it. Although I like the simpler approach of your solution there is one problem with it; it doesn't get rid of the square brackets, which was required by the OP.

        Cheers,

        JohnGG

Re: Complex Splitting - Parse::RecDescent
by imp (Priest) on Feb 06, 2007 at 15:02 UTC
    If the string being split will always be well-formed then I would go with one of the regex solutions provided above. If there is a possiblity that the data will be malformed then you may be better off with a parser approach as it would allow for more flexibility in error handling.

    Here is a solution using Parse::RecDescent.

    use Parse::RecDescent; use strict; use warnings; my $str = "ABC[GHI]XY[Z]1A"; my $grammar = <<'GRAMMAR'; token : '[' /[A-Z]*/ ']' {$return = $item[2]} | /[A-Z]/ anything : /./ GRAMMAR my $parser = Parse::RecDescent->new($grammar); # When a reference to a scalar is passed to Parse::RecDescent it will # consume the tokens as they are matched. To avoid modifying the origi +nal # string a copy will be used my $copy = $str; while ($copy ne '') { if (my $token = $parser->token(\$copy)) { print "Token: $token\n"; } else { my $token = $parser->anything(\$copy); print "Invalid symbol: $token\n"; } }

      That silently ignores whitespace (read up on <skip>).

      Also, P::RD is rather slow. I'd even say inexcusably slow if you're just using it as a tokenizer. May I suggest a much faster tokenizer?

      use strict; use warnings; sub process_token { my ($token) = @_; print("Token: $token\n"); } { my $str = "ABC[GHI]XY[Z]1A"; for ($str) { /\G \[ ([A-Z]*) \] /xgcs && do { process_token("$1"); redo }; /\G ([A-Z]) /xgcs && do { process_token("$1"); redo }; /\G (.) /xgcs && do { printf("Unexpected '%s' at pos %d\n", $1, pos()-length($1)); redo }; } }
Re: Complex Splitting
by tcf03 (Deacon) on Feb 06, 2007 at 14:03 UTC
    This should do the trick - Its only tested on your string though...
    #!/usr/bin/perl use strict; use warnings; my $string = "ABC[GHI]XYZ"; my @array; splitstring($string); push @array, split//, $string; print "$_ " for sort @array; sub splitstring { return unless ( $string =~ /.*\[\w+\].*/ ); if ( $string =~ /.*(\[(\w+)\]).*/ ) { my $remove = $1; my $str = $2; print "rem = $remove\n str = $str\n"; push @array, $str; $string =~ s/\Q$remove\E//; } splitstring($string); }
    UPDATE
    I substututed the string with [987]ABC[GHI]XYZ[dfg] and that works as well.

    regards

    Ted
    --
    "That which we persist in doing becomes easier, not that the task itself has become easier, but that our ability to perform it has improved."
      --Ralph Waldo Emerson
Re: Complex Splitting
by shmem (Chancellor) on Feb 06, 2007 at 14:57 UTC
    perl -le '$_="[987]ABC[GHI]XYZ[dfg]"; print "$_" for split/(?!\w+\])|\ +[|]/,$_'

    but that spits out extra fields. Let's add a grep:

    perl -le '$_="[987]ABC[GHI]XYZ[dfg]";print "$_" for grep{$_}split/(?!\ +w+\])|\[|]/,$_'

    --shmem

    _($_=" "x(1<<5)."?\n".q·/)Oo.  G°\        /
                                  /\_¯/(q    /
    ----------------------------  \__(m.====·.(_("always off the crowd"))."·
    ");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}
Re: Complex Splitting
by fenLisesi (Priest) on Feb 06, 2007 at 17:53 UTC
    Here is another attempt at expanding a one-liner into a big script, this one using a canned /regexp?/i:
    use strict; use warnings; use Regexp::Common qw(balanced); my @streams = (qw( [ABC][DEF][HIJ] ABC[DEF]HI[JK][LMNO]PQ [A][B]C[DEF]HI[JK][LMNO]PQ [A][B]C[DEF]HI[JK][LMNO]P[Q] [ABCDEFHIJKLMNOPQ] ABCDEF ), q()); my $PATTERN = $RE{balanced}{-parens=>'[]'}; for my $input (@streams) { print qq($input => ); my @pieces = (); while ($input =~ s/(\w*)($PATTERN)//) { my ($prematch, $match) = ($1, $2); push @pieces, split //, $prematch; push @pieces, $1 if $match =~ /(\w+)/; } push @pieces, split //, $input if length $input; printf qq(%s\n), join qq( ), @pieces; } __END__
    which prints:
    [ABC][DEF][HIJ] => ABC DEF HIJ ABC[DEF]HI[JK][LMNO]PQ => A B C DEF H I JK LMNO P Q [A][B]C[DEF]HI[JK][LMNO]PQ => A B C DEF H I JK LMNO P Q [A][B]C[DEF]HI[JK][LMNO]P[Q] => A B C DEF H I JK LMNO P Q [ABCDEFHIJKLMNOPQ] => ABCDEFHIJKLMNOPQ ABCDEF => A B C D E F =>
    Cheers.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://598530]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others exploiting the Monastery: (3)
As of 2022-06-26 08:22 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    My most frequent journeys are powered by:









    Results (84 votes). Check out past polls.

    Notices?