RegEx - Positive Look-ahead

tmharish has asked for the wisdom of the Perl Monks concerning the following question:

I am using the following to extract data with a {{Infobox ... }} block, with the catch being that there might be {{ ... }} blocks within it:

use strict   ;
use warnings ;

use Data::Dump qw( dump ) ;


my @data = <DATA> ;
my $data = join( '', @data ) ;

my @info_box_contents ;
while( $data =~ m/(
                     {{Infobox
                         .*?
                         ({{(?=}}))?
                         .*?
                      }}
                     )/xsg ) {
    print STDERR "MATCHED\n";
    push @info_box_contents, $1 ;
}

dump( \@info_box_contents ) ;




__DATA__
{{Infobox
 text text text 
 {{text text text 
 text {{text text}}
 text}} 
 {{{text {{text }} 
    text }}}
}}
blah blah 
blah blah 
blah blah 
{{Infobox
 text1 text1 text1 
 {{text1 text1 text1 
 text1 {{text1 text1}}
 text1}} 
 {{{text1 {{text1 }} 
    text1 }}}
}}
{{Infobox one}}
[download]

I get the following output:

MATCHED
MATCHED
MATCHED
[
  "{{Infobox\n text text text \n {{text text text \n text {{text text}
+}",
  "{{Infobox\n text1 text1 text1 \n {{text1 text1 text1 \n text1 {{tex
+t1 text1}}",
  "{{Infobox one}}",
]
[download]

I am expecting it to match the entire block up to the first mismatched '}}'

Whats more if I remove the '?' from my look-ahead and change that line from ({{(?=}}))? to ({{(?=}})) I match nothing.

Help will be greatly appreciated.

Comment on RegEx - Positive Look-ahead Select or Download Code

Replies are listed 'Best First'.
Re: RegEx - Positive Look-ahead by 7stud (Deacon) on Feb 05, 2013 at 20:28 UTC
Is this what you want??? use strict; use warnings; use 5.012; use Text::Balanced qw( extract_tagged extract_multiple ); my $text = <<'END_OF_STRING'; {{Infobox text text text {{text text text text {{text text}} text}} {{{text {{text }} text }}} }} blah blah blah blah blah blah {{Infobox text1 text1 text1 {{text1 text1 text1 text1 {{text1 text1}} text1}} {{{text1 {{text1 }} text1 }}} }} {{Infobox one}} END_OF_STRING my @infoboxes = extract_multiple( $text, [ \&my_extractor], undef, 1 ) +; sub my_extractor { extract_tagged( $text, "{{", "}}", ); } for my $infobox (@infoboxes) { say $infobox; say '' x 20; } --output:-- {{Infobox text text text {{text text text text {{text text}} text}} {{{text {{text }} text }}} }} ***************** {{Infobox text1 text1 text1 {{text1 text1 text1 text1 {{text1 text1}} text1}} {{{text1 {{text1 }} text1 }}} }} **************** {{Infobox one}} ****************** [download] Here's the same result using regexes via Regexp::Common: use strict; use warnings; use 5.012; use Regexp::Common qw( balanced ); my $text = <<'END_OF_STRING'; {{Infobox text text text {{text text text text {{text text}} text}} {{{text {{text }} text }}} }} blah blah blah blah blah blah {{Infobox text1 text1 text1 {{text1 text1 text1 text1 {{text1 text1}} text1}} {{{text1 {{text1 }} text1 }}} }} {{Infobox one}} END_OF_STRING my $pattern = $RE{ balanced } { -begin => '{{' } { -end => '}}' }; while ($text =~ /($pattern)/gxms) { say $1; say '' x 20; } --output:-- {{Infobox text text text {{text text text text {{text text}} text}} {{{text {{text }} text }}} }} ***************** {{Infobox text1 text1 text1 {{text1 text1 text1 text1 {{text1 text1}} text1}} {{{text1 {{text1 }} text1 }}} }} **************** {{Infobox one}} ****************** [download]	[reply] [d/l] [select]
Re^2: RegEx - Positive Look-ahead by tmharish (Friar) on Feb 06, 2013 at 07:35 UTC
Works like a charm - Thank you very much. Also this was really helpful and this has been noted.	[reply]
Re^2: RegEx - Positive Look-ahead by tmharish (Friar) on Feb 07, 2013 at 14:26 UTC
7stud Considering your other post ( which might or might not have stemmed from this ) I thought I would update this thread with the final solution that I used ( also for anyone else who might care ). I found that, considering `{{Infobox` was not the only chunk I needed, I was taking a huge performance hit. To avoid this I changed to a single sweep of the ( long ) text chunk as follows - I have removed the other parts that I extracted in the same sweep so as to stick to the OP topic. use strict ; use warnings ; use Data::Dump qw( dump ) ; my $text = <<'END_OF_STRING'; {{Infobox text text text {{text text text text {{text text}} text}} {{{text {{text }} text }}} END}} blah blah blah blah blah blah {{Infobox text1 text1 text1 {{text1 text1 text1 text1 {{text1 text1}} text1}} {{{text1 {{text1 }} text1 }}} }} {{Infobox one}} END_OF_STRING my $box_contents = _get_info_boxes( $text ) ; dump( $box_contents ) ; exit; sub _get_info_boxes { my $text = shift ; my @info_box_contents ; my $in_info_box ; my $this_info_box_content = "" ; my $bracket_count = 0 ; foreach my $line ( split( /\n/, $text ) ) { unless( $in_info_box ) { next unless( $line =~ /{{Infobox/ ) ; $in_info_box = 1 ; } $this_info_box_content .= $line . "\n" ; my $open_count = ( $line =~ tr/{// ) ; my $close_count = ( $line =~ tr/}// ) ; $bracket_count = $bracket_count + $open_count - $close_count ; if( $bracket_count == 0 ) { push @info_box_contents, $this_info_box_content ; $this_info_box_content = "" ; $in_info_box = 0 ; $bracket_count = 0 ; } } return \@info_box_contents ; } [download]	[reply] [d/l] [select]
Re: RegEx - Positive Look-ahead by sundialsvc4 (Abbot) on Feb 05, 2013 at 13:39 UTC
To me, this is the sort of problem that should be described in terms of a grammar, and then handled using Parse::RecDescent. (I have used that module quite extensively and it works very well.) Let me give you one small tip if you go that way... Part of your code will consist of the grammar and Perl code that will be compiled on-the-fly into it. This is a great place to put a `use` statement(s) that will link to any subroutines that you find that you need to make use of in the grammar code. Anyway, the advantage of this approach is that you describe the structure of the language being processed and let the parser do the magic. It takes a bit of practice to get the grammar right, heh, but the parser can handle the mechanics of regex’ing and backtracking so that your code doesn’t have to. You can extend it to do more things without getting buried.	[reply]
Re^2: RegEx - Positive Look-ahead by tmharish (Friar) on Feb 05, 2013 at 14:46 UTC
Thanks sundialsvc4 Am going through the documentation of that module. Will reply once I have completed that.	[reply]
Re^3: RegEx - Positive Look-ahead by 7stud (Deacon) on Feb 05, 2013 at 17:30 UTC
I just had my first run in with Parse::RecDescent, and after I figured out the basics from the docs, I posted some beginner tips here link.	[reply]
Re: RegEx - Positive Look-ahead by 7stud (Deacon) on Feb 05, 2013 at 18:42 UTC
This regex: `/[aaaaa]/` [download] is equivalent to: `/[a]/` [download] So this regex: `[^{{}}]` Is equivalent to: `[^{}]` I am expecting it to match the entire block up to the first mismatched '}}' Huh? First, you said this: I am using the following to extract data with a {{Infobox ... }} block, with the catch being that there might be {{ ... }} blocks within it. Then you said this: The problem is that this does not match if I have {{ something {{{ text }}} }} in my content ... Here is a regex that will 'match' both: `/./` [download] To ask a regex question that is specific enough to get a relevant answer, you need to post: An example of your text. The exact text you want to end up with. Note that for 2), you DO NOT post a description* of the text that you want to end up with. Why? Because descriptions are usually gibberish, and at best they are subject to different interpretations. If required, repeat steps 1) and 2) as many times as needed to highlight the twists and turns in the text you need to match.	[reply] [d/l] [select]
Re: RegEx - Positive Look-ahead by Anonymous Monk on Feb 05, 2013 at 13:54 UTC
I'm not clear headed ATM to figure out your regex, but you should check out these ?? perlfaq -> perlfaq6 -> Can I use Perl regular expressions to match balanced text? ?? perlfaq6#What good is \G in a regular expression? Re: print output from the text file. (marpa scanless dsl calculator), Re^2: Help with regular expression ( m/\G/gc ),	[reply]
Re^2: RegEx - Positive Look-ahead by tmharish (Friar) on Feb 05, 2013 at 14:45 UTC
Thank You! I modified my RegEx like so: `( # start of capture group 1 {{ # match an opening (?: [^{{}}]++ # one or more, non backtracking \| (?1) # found {{ or }}, so recurse to captur +e group 1 )* }} # match a closing ) # end of capture group 1` [download] The problem is that this does not match if I have {{ something {{{ text }}} }} in my content ...	[reply] [d/l]
Re^3: RegEx - Positive Look-ahead by Anonymous Monk on Feb 05, 2013 at 14:50 UTC
if you want to match {{{ and {{ maybe you can make it `[{]{2,3} # opener [}]{2,3} # closer` [download]	[reply] [d/l]
Re: RegEx - Positive Look-ahead by 7stud (Deacon) on Feb 08, 2013 at 03:08 UTC
Considering your other post ( which might or might not have stemmed from this Yes it did. I've been working on another solution...one using Parse::RecDescent, which I finally finished. I found that, considering {{Infobox was not the only chunk I needed, I was taking a huge performance hit. Can you briefly describe why? To avoid this I changed to a single sweep of the ( long ) text chunk as follows And of course, you can count your own brackets--I figured that is what Text::Balanced does. I'll offer up my Parse::RecDescent solution for the experts to comment on. After I got my grammar to match the text, I wanted to preserve the formatting of the original text, so I used an approach that records the positions in the text where the start and end of an infobox was found. Then I used substr() on the original text. I also thought it might help someone to see all the shenanigans I had to go through to check if my grammar matched. The grammar with all the actions I employed follows the finished program. use strict; use warnings; use 5.012; use Parse::RecDescent; $::RD_ERRORS = 1; #Parser dies when it encounters an error $::RD_WARN = 1; #Enable warnings - warn on unused rules &c. $::RD_HINT = 1; #Give out hints to help fix problems. #$::RD_TRACE = 1; #Trace parsers' behaviour my $text = <<'END_OF_TEXT'; {{Infobox aaa bbb ccc {{ddd eee fff ggg {{ hhh iii}} jjj}} {{{kkk {{lll}} mmm }}} }} no no no no no no no no no {{Infobox aaa2 bbb2 ccc2 {{ddd2 eee2 fff2 ggg2 {{hhh2 iii2}} jjj2}} {{{kkk2 {{lll2 }} mmm2 }}} }} {{Infobox 111}} END_OF_TEXT #Declare a global variable that can be loaded with #data from inside the parser: our @infobox_offsets; my $grammar = <<'END_OF_GRAMMAR'; { use 5.012; #enable say() use Data::Dumper; } startrule: paragraph(s) paragraph: infobox \| word(s) infobox: '{{Infobox' inner_block(s) '}}' { push @main::infobox_offsets, $itempos[1]->{offset}{from}, $itempos[3]->{offset}{to}, ; } inner_block: brace_block \| word(s) #Declare some my variables ('rulevars') for this rule: brace_block: <rulevar: ($lbraces, $rbraces)> brace_block: lbrace(2..) { $lbraces = join '', @{$item[1]}; $rbraces = "}" x length $lbraces; } inner_block(s) "$rbraces" word: m{ [^{}]+ }xms lbrace: / [{] /xms END_OF_GRAMMAR my $parser = Parse::RecDescent->new($grammar) or die "Bad grammar!\n"; defined $parser->startrule($text) or die "Can't match text"; #Using the recorded offsets for the infoboxes #print out the infobox substr()'s: my ( $start_infobox, $end_infobox, $length_infobox ); while (@infobox_offsets) { $start_infobox = shift @infobox_offsets; $end_infobox = shift @infobox_offsets; $length_infobox = 1 + $end_infobox - $start_infobox; say '' x 20; say substr $text, $start_infobox, $length_infobox, ; say '' x 20; } --output:-- ****************** {{Infobox aaa bbb ccc {{ddd eee fff ggg {{ hhh iii}} jjj}} {{{kkk {{lll}} mmm }}} }} **************** **************** {{Infobox aaa2 bbb2 ccc2 {{ddd2 eee2 fff2 ggg2 {{hhh2 iii2}} jjj2}} {{{kkk2 {{lll2 }} mmm2 }}} }} **************** **************** {{Infobox 111}} ****************** [download] Here is my grammar with additional actions that I used to to test that the grammar matched the text: my $grammar = <<'END_OF_GRAMMAR'; { use 5.012; #enable say() use Data::Dumper; } startrule: paragraph(s) paragraph: infobox { say Dumper(\@item); } \| word(s) infobox: '{{Infobox' inner_block(s) '}}' { my $inner_blocks = join "", @{$item[2]}; $return = join "\n", $item[1], $inner_blocks, $item[4]; } inner_block: brace_block \| word(s) { $return = join "\n", @{$item[1]} ; } brace_block: <rulevar: ($lbraces, $rbraces)> brace_block: lbrace(2..) { $lbraces = join '', @{$item[1]}; $rbraces = "}" x length $lbraces; } inner_block(s) "$rbraces" { #say Dumper(\@item); my $inner_blocks = join "", @{$item[3]}; $return = "$lbraces $inner_blocks $rbraces"; } word: m{ [^{}]+ }xms lbrace: / [{] /xms END_OF_GRAMMAR --output:-- $VAR1 = [ 'paragraph', '{{Infobox aaa bbb ccc {{ ddd eee fff ggg {{ hhh iii }}jjj }}{{{ kkk {{ lll }}mmm }}} ' ]; $VAR1 = [ 'paragraph', '{{Infobox aaa2 bbb2 ccc2 {{ ddd2 eee2 fff2 ggg2 {{ hhh2 iii2 }}jjj2 }}{{{ kkk2 {{ lll2 }}mmm2 }}} ' ]; $VAR1 = [ 'paragraph', '{{Infobox 111 ' ]; [download]	[reply] [d/l] [select]
Re^2: RegEx - Positive Look-ahead by tmharish (Friar) on Feb 12, 2013 at 14:10 UTC
Can you briefly describe why? I am parsing Wikipedia dumps, one such file is ( WARNING: 41MB file ) this. I rewrote my entire parser to sift through the file one line at a time - which turns out is much faster than loading up chunks and using RegEx on it.	[reply]

Back to Seekers of Perl Wisdom