Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

Splitting string using two overlapping patterns

by kpr (Initiate)
on Oct 24, 2013 at 19:46 UTC ( #1059537=perlquestion: print w/ replies, xml ) Need Help??
kpr has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to split header names for colums in a datafile. The issue I encountered was awkward formatting. A simple example of this headerline is: 'Iteration {Applied Field} {Total Energy} Mx'

Each element is separated by \s+. However, if the element contains whitespace, it is surrounded with curly brackets. The two patterns I try to match are '\s+{([\s\w:]+)}\s+' and '\s+([\w:]+)\s+' The latter one sometimes overlaps the first ('{blaa blaa blaa}').

I'm looking for guidance how to improve my current solution:

use warnings; use strict; my $str = 'Iteration {Applied Field} {Total Energy} Mx'; while(length($str)>0){ #print "str:$str:\n"; if($str =~ m/^({([\s\w:]+)}(\s+)?)/){ print "1: $2\n"; $str =~ s/^$1//; } elsif($str =~ m/^(([\w:]+)(\s+)?)/){ print "2: $1\n"; $str =~ s/^$1//; } else{die "error";} } Result: 2: Iteration 1: Applied Field 1: Total Energy 2: Mx
i.e. try pattern 1 if it fails, try pattern 2. Then, remove the matching part from the beginning. Is it possible to combine two patterns into normal split function or m/.../g pattern and capture the text from the middle?

Comment on Splitting string using two overlapping patterns
Download Code
Re: Splitting string using two overlapping patterns
by Lennotoecom (Pilgrim) on Oct 24, 2013 at 20:10 UTC
    $line = 'Iteration {Applied Field} A {Total Energy} Mx F G {Third + test line}'; print "$&\n" while $line=~s/(?<={)[\w\s]+(?=})|\w+//; Iteration Applied Field A Total Energy Mx F G Third test line
      print "$1\n" while $line=~s/(\w+\s+\w+|\w+)[}|\s+|\s+{]//;

      Metacharacters like  | are not meta-special in a character class, so  [}|\s+|\s+{] from the quoted regex is equivalent, with the pipe metacharacter explicitly escaped (for clarity), to  [\s{}\|] or, less verbosely, to the  [\s{}|] class. (In other words, there's no alternation in a character class.)

        noted.
        thank you
        I corrected my previous version,
        because it contained a serious mistake
Re: Splitting string using two overlapping patterns
by BrowserUk (Pope) on Oct 24, 2013 at 20:19 UTC

    Assuming you want to discard the {}s:

    $str = 'Iteration {Applied Field} {Total Energy} Mx Fx {a} B C D { +E F} G';; print for split '(?:}\s+{|\s+{|}\s+|\s+(?!\S+}))', $str;; Iteration Applied Field Total Energy Mx Fx a B C D E F G

    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
    network sites:
Re: Splitting string using two overlapping patterns
by AnomalousMonk (Abbot) on Oct 24, 2013 at 21:11 UTC

    Sometimes it's better to extract out what you want rather than trying to split out what you don't want. (\K available with Perl 5.10+.)

    >perl -wMstrict -le "my $str = 'Iteration {Applied Field} {Total Energy} {Foo} { a b + c d } Mx'; ;; my $rx_word = qr{ [[:alnum:]]+ }xms; my $rx_curly = qr{ { \s* \K $rx_word (?: \s+ $rx_word)* (?= \s* }) +}xms; ;; my @fields = $str =~ m{ $rx_curly | $rx_word }xmsg; printf qq{'$_' } for @fields; " 'Iteration' 'Applied Field' 'Total Energy' 'Foo' 'a b c d' 'Mx'
Re: Splitting string using two overlapping patterns
by sundialsvc4 (Monsignor) on Oct 24, 2013 at 23:59 UTC

    When I look at situations like this, I get really nervous that the incoming file might be even more inconsistent than I thought it was ... and that my “clever regex” solution might be less-robust than I need it to be.   I would not be confident that my code is, in fact, a verifiably-correct answer, due to the “clever regex.”   So, what I would probably choose to do, is to use a loop, and to break the string down right-to-left, pushing the pieces onto a stack-array as I parsed them.   For example:

    while ($str ne '') { $str =~ s/^\s+//; # REMOVE LEADING WHITESPACE last if ($str eq ''); # LEAVE LOOP EARLY IF IT WAS ALL-WHITESP +ACE if ($str =~ /^\{/) { # STARTS WITH '{' ... elsif ... ...
    Well, you get the idea, I think.

    Even though this code might-or might-not be “efficient,” I am fairly confident that I could debug it, and that I could extend it to cover new cases and be confident (a) that the new changes work, and that (b) I didn’t break something in the process.

      It looks like he's parsing the headers from output of a specific program, so it ought to be consistent. Your method and other suggestions should work, but the way he's looking to do it should be fine. This isn't large-scale logfile parsing...

      Bioinformatics
Re: Splitting string using two overlapping patterns
by kcott (Abbot) on Oct 25, 2013 at 05:50 UTC

    G'day kpr,

    Welcome to the monastery.

    "Is it possible to combine two patterns into normal split function or m/.../g pattern and capture the text from the middle?"

    Yes, this is possible. There's also many ways to do it. I see a number of solutions have already been posted. Here's another one.

    I've only included code that's been available since v5.8 or earlier. I've shown how to use the 'x' modifier to make your regex easier to read, easier to maintain and, generally, easier to deal with. Uncomment any of the "my $re = ..." lines to see that they all work the same (obviously, only uncomment one at a time). I've used test data that you posted as well as that provided by some of the monks who've already posted solutions.

    Here's the code. Either "perlre - Perl regular expressions" or "perlvar: Variables related to regular expressions" should supply answers to any questions you have; if not, feel free to ask.

    #!/usr/bin/env perl -l use strict; use warnings; my @lines = ( 'Iteration {Applied Field} {Total Energy} Mx', '{blaa blaa blaa}', 'Iteration {Applied Field} {Total Energy} Mx', 'Iteration {Applied Field} A {Total Energy} Mx F G {Third tes +t line}', 'Iteration {Applied Field} {Total Energy} Mx Fx {a} B C D {E F +} G', 'Iteration {Applied Field} {Total Energy} {Foo} { a b c d } + Mx', ); my $re_hard_to_read = qr/(?>[{]([^}]+)|\s*(?![{}])(\S+))/; my $re_easy_to_read = qr/(?> [{] ( [^}]+ ) | \s* (?! [{}] ) ( \S+ ) )/x; my $re_fully_annotated = qr/ (?> # start non-capturing, non-backtracking, alternati +on [{] # MATCH: exactly one literal left brace ( # start capture [^}]+ # CAPTURE: one or more of any character except rig +ht brace ) # end capture | # - OR - \s* # MATCH: zero or more whitespace (?! # start zero-width negative lookahead assertion [{}] # ASSERT: next character is not left or right brac +e ) # end zero-width negative lookahead assertion ( # start capture \S+ # CAPTURE: one or more non-whitespace characters ) # end capture ) # end non-capturing, non-backtracking, alternation /x; #my $re = $re_hard_to_read; #my $re = $re_easy_to_read; my $re = $re_fully_annotated; for (@lines) { # Print output heading print join "\n$_\n" => ('-' x 60) x 2; # Array to hold header names my @header_names; # Capture header names push @header_names => $+ while /$re/g; # Output header names print join "\n" => @header_names; }

    The output starts like this:

    ------------------------------------------------------------ Iteration {Applied Field} {Total Energy} Mx ------------------------------------------------------------ Iteration Applied Field Total Energy Mx ------------------------------------------------------------ {blaa blaa blaa} ------------------------------------------------------------ blaa blaa blaa

    Here's the rest:

    -- Ken

Re: Splitting string using two overlapping patterns
by oiskuu (Pilgrim) on Oct 25, 2013 at 11:52 UTC
    And the 6th way to skin the cat:
    while ($str =~ m/{(.+?)}|([\w:]+)/g) { print $+; }
    .+? is a non-greedy match-anything. $+ is last captured group.
Re: Splitting string using two overlapping patterns
by hdb (Prior) on Oct 25, 2013 at 12:00 UTC

    And I think, you should a) replace consecutive white space with a single blank, b) replace braces with quotes, and then c) use Text::CSV to split into fields.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1059537]
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others meditating upon the Monastery: (6)
As of 2014-09-18 22:16 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (125 votes), past polls