Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

Regular expressions: Extracting certain text from a line

by Wcool (Novice)
on Apr 07, 2014 at 10:31 UTC ( #1081385=perlquestion: print w/replies, xml ) Need Help??
Wcool has asked for the wisdom of the Perl Monks concerning the following question:

Hi Perl Monks, Hi Perl Monks,

A program I am working on is parsing text to extract anything enclosed in curly or straight brackets. However it should not extract {} or [] (empty enclosures)

For instance if the input line is:

EntityMappingFetchByName?[](EntityName$ = DERIVED_ATTRIBUTE_TABLE, Fie +ldNames$[] = [ USER_ENTITY_NAME ], text${} = { this is a test }), lin +e 6
should return "[ USER_ENTITY_NAME ]" and "{ this is a test }" only. When I use this pattern:

((\{.+?\})|(\[.+?\])) in a while loop

I get:

MATCH = [](EntityName$ = DERIVED_ATTRIBUTE_TABLE, FieldNames$[] MATCH = [ USER_ENTITY_NAME ] MATCH = {} = { this is a test }

Whatever I try, the expression always seems to be greedy. How do I make the expression not greedy and skipping empty {} and []?

Thanks for your answers

Replies are listed 'Best First'.
Re: Regular expressions: Extracting certain text from a line
by Corion (Pope) on Apr 07, 2014 at 10:42 UTC

    You have \{.+?\}.

    From your description, you don't want [] within {...} and you don't want {} within [ ... ], so your "in between" groups should reflect that:

    /(\{[^\[\]]{}]+?\})|.../

    I think there are ways to better extract stuff within matching pairs of parentheses, but in the long run, you'll have to look at a proper parser for your grammar.

      I used the
      tag as the pseudo HTML is creating havoc on square brackets. <code> Basically I want the most outer [ some chars] or { some chars } but no +t [] or {} 2 other examples: 1) a[] = [ this is a test { test2 } ] Should only match [ this is a test { test2 } ] 2) a[] = [ this is a [ test ] { test2 } ] Should return [ this is a [ test ] { test2 } ]

      I simplified by looking only at square brackets but still no joy.

      I thought of something like this:

      \[ <- a square bracket .[^\/] <- followed by any character but not an end bracket + <- at least one character
      I give up, I will just look for brackets and if it matches empty brackets I filter them out in the code

        Don't give up too quickly!

        (\[(?:[^\[\]]++|(?1))+\])

        That should get you started ;-)

Re: Regular expressions: Extracting certain text from a line
by Discipulus (Monsignor) on Apr 07, 2014 at 10:46 UTC
    only a tip:

    try the davido's precious regex test website
    or if you are on windows you can try regex-coach program

    HtH
    L*

    There are no rules, there are no thumbs..
    Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.
      Thanks for the link, very good to debug Regex!
Re: Regular expressions: Extracting certain text from a line
by kcott (Chancellor) on Apr 08, 2014 at 06:17 UTC

    G'day Wcool,

    You get the output:

    MATCH = [](EntityName$ = DERIVED_ATTRIBUTE_TABLE, FieldNames$[]

    because

    1. Regex '\[' matches character in line: '['.
    2. Regex '.+?' matches at least one character (']') and then continues to match non-greedily (i.e. up to but not including the next ']'), that's '(EntityName$ = DERIVED_ATTRIBUTE_TABLE, FieldNames$['.
    3. Regex '\]' matches the ']' after that.

    So, instead of matching any character one or more times non-greedily (i.e. '.+?'), what you really want is to match any character that isn't ']' one or more times greedily (i.e. '[^\]]+').

    You get the ouput:

    MATCH = {} = { this is a test }

    for much the same reasons. The fix is similar, changing '.+?' to '[^}]+'.

    Also, unless you really want those extra captures, you can lose the two inner pairs of parentheses.

    Here's my test:

    #!/usr/bin/env perl -l use strict; use warnings; my $line = 'EntityMappingFetchByName?[](EntityName$ = DERIVED_ATTRIBUT +E_TABLE, FieldNames$[] = [ USER_ENTITY_NAME ], text${} = { this is a +test }), line 6'; my $re = qr< ( { [^}]+ } | \[ [^\]]+ \] ) >x; print "MATCH = $1" while $line =~ /$re/g;

    Output:

    MATCH = [ USER_ENTITY_NAME ] MATCH = { this is a test }

    -- Ken

        Grrr! I wish they wouldn't do that.

        Anticipating more ante upping, with deeply nested brace/bracket combos and wanting to capture a nested (but not an isolated) '{}' or '[]', e.g. '{ {} }', here's (maybe) a bit of a cheat:

        #!/usr/bin/env perl -l use strict; use warnings; my ($brace_re, $bracket_re); $brace_re = qr< { (?: [^{}]++ | (??{ $brace_re }) )* } >x; $bracket_re = qr< \[ (?: [^\[\]]++ | (??{ $bracket_re }) )* \] >x; my $re = qr< ( $brace_re | $bracket_re ) >x; while (<DATA>) { print; while (/$re/g) { print "MATCH = $1" if length $1 > 2; } print '-' x 60; } __DATA__ ...?[](...$[] = [ USER_ENTITY_NAME ], text${} = { this is a test })... a[] = [ this is a [ test ] { test2 } ] a{} = { this is a { test } [ test2 ] } { a { b [ {}c{} ] d } e } = [ f [ g { []h[] } i ] j ] {}[]{ {}[] }[]{} - []{}[ []{} ]{}[]

        Output:

        ...?[](...$[] = [ USER_ENTITY_NAME ], text${} = { this is a test })... MATCH = [ USER_ENTITY_NAME ] MATCH = { this is a test } ------------------------------------------------------------ a[] = [ this is a [ test ] { test2 } ] MATCH = [ this is a [ test ] { test2 } ] ------------------------------------------------------------ a{} = { this is a { test } [ test2 ] } MATCH = { this is a { test } [ test2 ] } ------------------------------------------------------------ { a { b [ {}c{} ] d } e } = [ f [ g { []h[] } i ] j ] MATCH = { a { b [ {}c{} ] d } e } MATCH = [ f [ g { []h[] } i ] j ] ------------------------------------------------------------ {}[]{ {}[] }[]{} - []{}[ []{} ]{}[] MATCH = { {}[] } MATCH = [ []{} ] ------------------------------------------------------------

        Update: For Perl v5.8, you'll need to change [...]++ to (?> [...]+ ) (the '++' appeared in v5.10) and qr<...> delimiters will need to be something else, e.g. qr!...!.

        The '(??{ $re })' construct has been around since at least v5.8.8.

        Here's the perlre doco for 5.8.8 and 5.10.0.

        -- Ken

Re: Regular expressions: Extracting certain text from a line
by Anonymous Monk on Apr 08, 2014 at 11:00 UTC
    Thanks all for your replies. Very much appreciated. I didn't realize I upped the ante but it makes a big difference indeed if nested structures should be skipped.

    I will try all your input out, thanks again!

    Wcool

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1081385]
Approved by mtmcc
Front-paged by kcott
help
Chatterbox?
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others chanting in the Monastery: (5)
As of 2018-06-20 19:11 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Should cpanminus be part of the standard Perl release?



    Results (117 votes). Check out past polls.

    Notices?