Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

Regex for outside of brackets

by theravadamonk (Scribe)
on Jul 13, 2018 at 10:49 UTC ( [id://1218434]=perlquestion: print w/replies, xml ) Need Help??

theravadamonk has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks

Is there a way to catch texts outside of brackets? I am looking for a regex..

this is my string

THIS IS OUTSIDE (THIS IS INSIDE)

What I expect is

THIS IS OUTSIDE

below regex can catch what is inside.

\((.*?)\)
How can I catch things except what is inside?

below matches everything except "(" and ")"

[^()]

Your INPUTS?

Replies are listed 'Best First'.
Re: Regex for outside of brackets
by haukex (Archbishop) on Jul 13, 2018 at 11:27 UTC

    As always, the more examples, the better - see How to ask better questions using Test::More and sample data. For example, can there be other parentheses anywhere in the string? Can there be text after the parentheses that you want to capture? etc.

    Here's one way that works even when there are multiple sets of parentheses and strings outside of them.

    use warnings; use strict; my $str1 = "THIS IS OUTSIDE (THIS IS INSIDE)"; (my $x = $str1) =~ s/\(.*?\)//g; # $x is "THIS IS OUTSIDE " my $str2 = "OUTSIDE (INSIDE) OUTSIDE"; (my $y = $str2) =~ s/\(.*?\)//g; # $y is "OUTSIDE OUTSIDE" my $str3 = "OUT(IN)OUT(IN)OUT(IN)(IN)OUT"; (my $z = $str3) =~ s/\(.*?\)//g; # $z is "OUTOUTOUTOUT"

    The above does not work for nested parens, though. For that, you could use Regexp::Common::balanced:

    use Regexp::Common qw/balanced/; my $str4 = "a(b(c(d)e)f)g(h(i(j)k)l)m"; (my $t = $str4) =~ s/$RE{balanced}{-parens=>'()'}//g; # $t is "agm"
      As always, the more examples, the better

      Completely agree. However, a well-defined specification would be better still.

Re: Regex for outside of brackets
by hippo (Bishop) on Jul 13, 2018 at 11:05 UTC
    use strict; use warnings; use Test::More; my @rec = ( { have => 'THIS IS OUTSIDE (THIS IS INSIDE)', want => 'THIS IS OUTSIDE ' }, ); plan tests => scalar @rec; for my $this (@rec) { my ($have) = $this->{have} =~ /^([^(]*)/; is ($have, $this->{want}) }

      Thanks a lot

      here's my easy code

      #!/usr/bin/perl use strict; use warnings; my $string ="THIS IS OUTSIDE (THIS IS INSIDE)"; print "\n"; print "string: $string"; print "\n"; #(my $new_string = $string) =~ s/^([^(]*)/THIS_IS_OUTSIDE_CAPTURED/g; (my $new_string = $string) =~ s/^([^(]*)//g; print "\n"; print "new_string: $new_string"; print "\n\n";
Re: Regex for outside of brackets
by tybalt89 (Monsignor) on Jul 13, 2018 at 13:40 UTC
    #!/usr/bin/perl # https://perlmonks.org/?node_id=1218434 use strict; use warnings; my $string = 'THIS IS OUTSIDE (THIS IS INSIDE)'; my $inside = 0; # true if inside parens my $onlyoutside = $string =~ s/ (\() | (\)) | ([^()]+) / $1 ? $inside++ x 0 : $2 ? $inside-- x 0 : $3 x !$inside /gexr; print "before <$string>\n after <$onlyoutside>\n";
Re: Regex for outside of brackets
by atcroft (Abbot) on Jul 13, 2018 at 14:37 UTC

    I think this may be close to what you are asking, and only uses a module shipped in core, Text::Balanced:

    $ perl -MData::Dumper -MText::Balanced -le ' use 5.010; $Data::Dumper::Deepcopy = $Data::Dumper::Sortkeys = 1; my $c = q{This is outside (This is inside).}; my %found; ( $found{prefix}, $found{bracketed}, $found{postfix}, ) = Text::Balanced::extract_multiple($c, [ \&Text::Balanced::extract_bracketed, ], ); say Data::Dumper->Dump( [ \%found, ], [ qw( *f ) ] ); ' %f = ( 'bracketed' => '(This is inside)', 'postfix' => '.', 'prefix' => 'This is outside ' );

    Hope that helps.

Re: Regex for outside of brackets
by mr_mischief (Monsignor) on Jul 13, 2018 at 22:03 UTC

    Are you sure you want a match, and that you can only use a single regex? I have bad news... a regular language has no context. However, a regular expression and another tool or handful of tools can easily get you there. Take, for example, the substitution operator, a counter with a loop and some more regexes, or a regex match and a split on the match... Of course, feel free to use Text::Balanced as atcroft suggests or use some other toolset built for the level of the problem you're trying to solve. Regexes will only solve a subproblem of your problem.

    Here's the data file for the following examples.

    THIS IS OUTSIDE (THIS IS INSIDE) (inside) outside before (within) after before (within) between (within again) after b ((nested)) a before (within (nested)) after This one hangs (with an unmatched open This one has () an empty pair This opens (with one)) and double closes this is the last (really) one

    Now here's the first example, using the substitution operator.:

    #!perl use strict; use warnings; use 5.12.0; my $cleanup = 1; while ( <> ) { chomp; s/\(+.*?\)+//g; y/ / /s, s/^\s|\s$// if $cleanup; say; }

    The above code produces the following output by substituting 0 characters in place of any pair of parentheses with anything between them. As written, it eliminates matched pairs and their contents but will also eliminate an extra closing parenthesis and will include in the output an opened but not closed parenthetical.:

    THIS IS OUTSIDE outside before after before between after b a before after This one hangs (with an unmatched open This one has an empty pair This opens and double closes this is the last one

    Or if prefer to preserve whitespace as it was, set $cleanup to 0

    This next example produces mostly the same output as the ącleanup = 0 version of the above. It does so by counting nesting level of the parentheses after splitting the string into an array of characters. It then appends to the output string if the nesting level is 0 (outside of any pairs of parentheses). This one will produce its last output before a hanging opened and unclosed pair. It will, as written, also not include in the output negative nesting levels (text trailing an extra close unmatched by an open).

    #!perl use strict; use warnings; use 5.12.0; while ( <> ) { chomp; my $str = ''; my @parts = split //; my $inside = 0; for ( @parts ) { /\(/ && $inside++ && next; /\)/ && $inside-- && next; $str .= $_ unless $inside; } say $str; }
    THIS IS OUTSIDE outside before after before between after b a before after This one hangs This one has an empty pair This opens this is the last one

    Or if you want to feed the match from one regex match into a split on that match...

    #!perl use strict; use warnings; use 5.12.0; while ( my $str = <> ) { chomp $str; my $extract = join '|', map { "\Q$_\E" } ( $str =~ m/(\(+.*?\)+)/g + ); say join '', split /$extract/, $str; }

    The above works because we know what we want to eliminate, which is a good use for split. In this particular case, we don't have a fixed regex against which to split, but we know how to match what we don't want. This solution captures that unwanted part, quotes it with \Q and \E, joins any multiples with the regex alternation (pipe, or '|'), then uses split and join to leave what's left of the string as a single string. This as written will only eliminate matched pairs and their contents. This is basically emulating the substitution operator. One's intuition may be that since it's a more detailed treatment it'll be faster. However, we're more Perl here, and the substitution operator is highly optimized. I don't know without benchmarking by how much, but I'm betting the example with the s/// is faster.

    The second example above is fairly easy to tweak to give the sort of error messages you might expect out of a lexer, since it is kind of a degenerative case of one.

    #!perl use strict; use warnings; use 5.12.0; my $inside = 0; while ( <> ) { chomp; my $str = ''; my @parts = split //; $inside = 0; for ( @parts ) { 0 > $inside && $inside++ && warn "WARNING: extra close on line + $.\n"; /\(/ && $inside++ && next; /\)/ && $inside-- && next; $str .= $_ unless $inside > 0; } warn "WARNING: Unclosed parenthetical on line $.\n" if $inside; say $str; }
    THIS IS OUTSIDE outside before after before between after b a before after WARNING: Unclosed parenthetical on line 7 This one hangs This one has an empty pair WARNING: extra close on line 9 This opens ) and double closes this is the last one
      You can get the same output as in the second example by slightly modifying the substitution in the first example and calling it in a loop:
      1 while s/\([^()]*\)//g;

      To get the warnings, just check for remaining ( or ).

      ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,

        My goal wasn't particularly to force them all to the same output. I'm yet to be sure what output OP wants beyond the single case of input. It was to demonstrate that there are multiple ways to do things, none of which are a single regular expression match. Thanks for helping expand on that.

      Are you sure you want a match, and that you can only use a single regex? I have bad news... a regular language has no context. ... Regexes will only solve a subproblem of your problem.

      from ISBN 9780596004927: Programming Perl (4th Ed), Pattern Matching - Chapter 5, p. 167

      If you're acquainted with regular expressions from some other venue, we should warn you that regular expressions are a bit different in Perl. First they aren't entirely "regular" in the theoretical sense of the word, which means they can do much more than the traditional regular expressions taught in Computer Science classes.
      Mentioned similarly in Wikipedia

      Brackets with Perl regex is a FAQ, Can I use Perl regular expressions to match balanced text?, and is also covered in Rosetta Code. The regex techniques used are relatively new(er) and somewhat advanced but other answers have discussed modules that hide the technique behind a simple interface.

      Correction based on feedback from mr_mischief below:

      I also studied "regular languages"/"regular sets" in computer science and, understand the confusion while understanding the intention to avoid unneeded extensions, worry that ignoring all extensions to the more formal concept of a regular expression will cause more confusion. The Wikipedia article mentions backreferences as an example of a very commonly used extension. I don't really see the difference covered in perlre , perlrequick or perlretut and would be interested in any suggestions on finding or adding the information for any of those documents.

      Ron

        I hand-waived over Perl regexes not being entirely regular. They are still not (without doing really convoluted things) a Turing complete language. Matching the complement of the text that's bracketed is still a step beyond matching bracketed text. Just because one can hammer a screw does not make a hammer a screwdriver nor a screw a nail.

        I remember a time when helping people with questions that this community would actually try to offer advice on good, clean ways to do things. I'm unsure of the value of arguing that because something is strictly possible that we should help do things the wrong way.

      Hmm, Many codes. things to learn. Thanks a LOT for your wonderful efforts.

Re: Regex for outside of brackets
by Anonymous Monk on Jul 13, 2018 at 11:58 UTC
    Another way to do it would be to apply a regex repeatedly to the string, a regex which looks for a balanced-parentheses group. Note the substring-position where that pattern begins, and calculate where it ends. Iterate through the loop, extracting the appropriate substrings given their now-known starting and ending positions. Source-code to do it left as an exercise for the reader.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1218434]
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others surveying the Monastery: (5)
As of 2024-04-23 20:58 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found