http://www.perlmonks.org?node_id=1059584


in reply to Splitting string using two overlapping patterns

G'day kpr,

Welcome to the monastery.

"Is it possible to combine two patterns into normal split function or m/.../g pattern and capture the text from the middle?"

Yes, this is possible. There's also many ways to do it. I see a number of solutions have already been posted. Here's another one.

I've only included code that's been available since v5.8 or earlier. I've shown how to use the 'x' modifier to make your regex easier to read, easier to maintain and, generally, easier to deal with. Uncomment any of the "my $re = ..." lines to see that they all work the same (obviously, only uncomment one at a time). I've used test data that you posted as well as that provided by some of the monks who've already posted solutions.

Here's the code. Either "perlre - Perl regular expressions" or "perlvar: Variables related to regular expressions" should supply answers to any questions you have; if not, feel free to ask.

#!/usr/bin/env perl -l use strict; use warnings; my @lines = ( 'Iteration {Applied Field} {Total Energy} Mx', '{blaa blaa blaa}', 'Iteration {Applied Field} {Total Energy} Mx', 'Iteration {Applied Field} A {Total Energy} Mx F G {Third tes +t line}', 'Iteration {Applied Field} {Total Energy} Mx Fx {a} B C D {E F +} G', 'Iteration {Applied Field} {Total Energy} {Foo} { a b c d } + Mx', ); my $re_hard_to_read = qr/(?>[{]([^}]+)|\s*(?![{}])(\S+))/; my $re_easy_to_read = qr/(?> [{] ( [^}]+ ) | \s* (?! [{}] ) ( \S+ ) )/x; my $re_fully_annotated = qr/ (?> # start non-capturing, non-backtracking, alternati +on [{] # MATCH: exactly one literal left brace ( # start capture [^}]+ # CAPTURE: one or more of any character except rig +ht brace ) # end capture | # - OR - \s* # MATCH: zero or more whitespace (?! # start zero-width negative lookahead assertion [{}] # ASSERT: next character is not left or right brac +e ) # end zero-width negative lookahead assertion ( # start capture \S+ # CAPTURE: one or more non-whitespace characters ) # end capture ) # end non-capturing, non-backtracking, alternation /x; #my $re = $re_hard_to_read; #my $re = $re_easy_to_read; my $re = $re_fully_annotated; for (@lines) { # Print output heading print join "\n$_\n" => ('-' x 60) x 2; # Array to hold header names my @header_names; # Capture header names push @header_names => $+ while /$re/g; # Output header names print join "\n" => @header_names; }

The output starts like this:

------------------------------------------------------------ Iteration {Applied Field} {Total Energy} Mx ------------------------------------------------------------ Iteration Applied Field Total Energy Mx ------------------------------------------------------------ {blaa blaa blaa} ------------------------------------------------------------ blaa blaa blaa

Here's the rest:

-- Ken