http://www.perlmonks.org?node_id=1061936

cadphile has asked for the wisdom of the Perl Monks concerning the following question:

I have what seems like a simple parsing problem, but am stumped. I have single lines in input with repeated patterns of arbitrary length that match this general format (whown with 3 patterns):

$str = q!AND (random text) AND (more random text) AND (yet more)!; $str = q!OR (random text) OR (more random text) OR (yet more)!;
I want to build a parser that loops through the input line and repeatedly extracts either the token "AND" or "OR", and then eats up everything in the line until the next "AND" or "OR", or EOL whichever comes first. I've tried this:
while ($str =~ /\G(AND|OR)\s+(.+?)/g) { printf("%s %s\n", $1, $2); }
but that doesn't work -- the 2nd pattern eats everything to the EOL. Since the 2nd pattern is bounded by parens, I've tried this:
while ($str =~ /\G(AND|OR)\s+(\(.+?\))/g) { printf("%s %s\n", $1, $2); }
This does work, however, if the "random text" includes any parens itself, then the pattern match fails, by matching the first end-parenthesis in the text. E.g. this string breaks the pattern match:

$str = q!AND (random text) AND (yet (more))!;

What's a good way to eat up the line, capturing the "AND" or "OR" token in $1, and the "random text" in $2, on and on until we hit the EOL. Your regex expertise is appreciated.

Replies are listed 'Best First'.
Re: simple perl regex question (or is it?)
by LanX (Saint) on Nov 11, 2013 at 00:02 UTC
    I think you need to anchor the end not the start of your match.

    something like /(.*?)(AND|OR|$)/g

    if you really need the preceding AND|OR do a positive look ahead

    /(AND|OR)(.+?)(?=AND|OR|$)/g

    sorry can't test ATM.

    (Hopefully you don't expect this to work with nested ANDs and ORs within the parens?)

    update

    looks good for me HTH

    DB<100> $str = q!AND (random text) AND (more random text) AND (yet m +ore)!; => "AND (random text) AND (more random text) AND (yet more)" DB<101> print "$1: $2\n" while $str =~ /(AND|OR)(.+?)(?=AND|OR|$)/g AND: (random text) AND: (more random text) AND: (yet more)

    update

    if you can deal with leading empty strings try split

    DB<103> @matches = split /(AND|OR)/,$str => ( "", "AND", " (random text) ", "AND", " (more random text) ", "AND", " (yet more)", )

    Cheers Rolf

    ( addicted to the Perl Programming Language)

      Hi Rolf, thanks for your quick reply. The look-ahead did the trick:
      $ perl -le '$str = q{AND (random text) OR (more random text) AND (yet +(more))};print "$1: $2" while ($str =~ /(AND|OR)\s+(.+?)(?=AND|OR|$)/ +g)' AND: (random text) OR: (more random text) AND: (yet (more))
      In this exercise, I don't know in advance whether $1 will match "AND" or "OR", so I do want that value captured, which is why I didn't want to use split.

      Funny that you should ask about embedded nested ANDs or ORs. The system I'm working on would benefit from this, but I'd have to deal with balanced parentheses and use recursion probably to parse the entire structure. Assuming I could refine to code to do this accurately, I'd probably just end up hung by my own petard...

      Many thanks! -Harry

        ... embedded nested ANDs or ORs. The system I'm working on would benefit from this ...

        Easily enough done with an approach like that of Re: simple perl regex question (or is it?) (and probably others as well): just take whatever you get from the  $nested_parens portion of the match (with or without the enclosing ()s) and run it through the main regex again! Repeat until there is no match, i.e., no nested  AND|OR(...) to be found.

Re: simple perl regex question (or is it?)
by ww (Archbishop) on Nov 11, 2013 at 00:12 UTC
    #!/usr/bin/perl use 5.016; use warnings; #1061936 my $str = q!AND (random text) AND (more random text) AND (yet more)!; my $str2 = q!OR (random text) OR (more random text) OR (yet more)!; while ( $str =~ /(AND )(\(.*\))?( AND|$)/ig ) { say $2; } while ( $str2 =~ /(OR )(\(.*\))?( OR|$)/ig ) { say "Rowing: $2"; } =head C:\>D:\1061936.pl (random text) AND (more random text) AND (yet more) Rowing: (random text) OR (more random text) OR (yet more) =cut

    Combining the regexen into a single regex left as a learning opportunity for OP.... except that whilst I posted this, LanX has taken care of the matter. :-(

    As to the paren problem, run this after modifying a string with nested parens; then see regex docs re escaping.

    Last para is an afterthought/update.

Re: simple perl regex question (or is it?)
by hdb (Monsignor) on Nov 11, 2013 at 09:13 UTC

    My preference would be to use split. I would split on patterns like ") AND (" and ") OR (" which would guard against capitalized AND or OR appearing in your random string and remove the parentheses as well.

    use strict; use warnings; use Data::Dumper; my $str1 = q!AND (random text) AND (more random text) AND (yet more)!; my $str2 = q!OR (random text) OR (more random text) OR (yet (more))!; for ( $str1, $str2 ) { my @texts = grep $_ # filter out empty strings , split / (?:\)\s|^) # either closing parentheses or start + of string (?:AND|OR)\s\( # followed by AND or OR and opening p +arentheses | # or \)$ # closing parentheses at end of strin +g /x; print Dumper \@texts; }
Re: simple perl regex question (or is it?)
by AnomalousMonk (Archbishop) on Nov 11, 2013 at 15:46 UTC

    This is almost a copy of the example from the  "(?PARNO)" discussion in Extended Patterns in perlre. It needs Perl version 5.10+. Note there is an extraneous ) at the end of the  $s test string.

    >perl -wMstrict -le "use 5.010; ;; my $s = q{xx AND(random text) xx OR(more (and more)) AND (or (more (too)))) +}; ;; my $intro = qr{ AND | OR }xms; my $nested_parens = qr{ ( \( ( (?: [^()]*+ | (?-2))* ) \) ) }xms; ;; while ($s =~ m{ ($intro) \s* $nested_parens }xmsg) { print qq{\$1 '$1' \$2 '$2' \$3 '$3'}; } " $1 'AND' $2 '(random text)' $3 'random text' $1 'OR' $2 '(more (and more))' $3 'more (and more)' $1 'AND' $2 '(or (more (too)))' $3 'or (more (too))'

    Update: Changed
        printf qq{\$1 '$1'  \$2 '$2'  \$3 '$3' \n}, $1, $2, $3;
    in while-loop to something that makes more sense.