http://www.perlmonks.org?node_id=1017749


in reply to RegEx - Positive Look-ahead

Considering your other post ( which might or might not have stemmed from this

Yes it did. I've been working on another solution...one using Parse::RecDescent, which I finally finished.

I found that, considering {{Infobox was not the only chunk I needed, I was taking a huge performance hit.

Can you briefly describe why?

To avoid this I changed to a single sweep of the ( long ) text chunk as follows

And of course, you can count your own brackets--I figured that is what Text::Balanced does.

I'll offer up my Parse::RecDescent solution for the experts to comment on. After I got my grammar to match the text, I wanted to preserve the formatting of the original text, so I used an approach that records the positions in the text where the start and end of an infobox was found. Then I used substr() on the original text.

I also thought it might help someone to see all the shenanigans I had to go through to check if my grammar matched. The grammar with all the actions I employed follows the finished program.

use strict; use warnings; use 5.012; use Parse::RecDescent; $::RD_ERRORS = 1; #Parser dies when it encounters an error $::RD_WARN = 1; #Enable warnings - warn on unused rules &c. $::RD_HINT = 1; #Give out hints to help fix problems. #$::RD_TRACE = 1; #Trace parsers' behaviour my $text = <<'END_OF_TEXT'; {{Infobox aaa bbb ccc {{ddd eee fff ggg {{ hhh iii}} jjj}} {{{kkk {{lll}} mmm }}} }} no no no no no no no no no {{Infobox aaa2 bbb2 ccc2 {{ddd2 eee2 fff2 ggg2 {{hhh2 iii2}} jjj2}} {{{kkk2 {{lll2 }} mmm2 }}} }} {{Infobox 111}} END_OF_TEXT #Declare a global variable that can be loaded with #data from inside the parser: our @infobox_offsets; my $grammar = <<'END_OF_GRAMMAR'; { use 5.012; #enable say() use Data::Dumper; } startrule: paragraph(s) paragraph: infobox | word(s) infobox: '{{Infobox' inner_block(s) '}}' { push @main::infobox_offsets, $itempos[1]->{offset}{from}, $itempos[3]->{offset}{to}, ; } inner_block: brace_block | word(s) #Declare some my variables ('rulevars') for this rule: brace_block: <rulevar: ($lbraces, $rbraces)> brace_block: lbrace(2..) { $lbraces = join '', @{$item[1]}; $rbraces = "}" x length $lbraces; } inner_block(s) "$rbraces" word: m{ [^{}]+ }xms lbrace: / [{] /xms END_OF_GRAMMAR my $parser = Parse::RecDescent->new($grammar) or die "Bad grammar!\n"; defined $parser->startrule($text) or die "Can't match text"; #Using the recorded offsets for the infoboxes #print out the infobox substr()'s: my ( $start_infobox, $end_infobox, $length_infobox ); while (@infobox_offsets) { $start_infobox = shift @infobox_offsets; $end_infobox = shift @infobox_offsets; $length_infobox = 1 + $end_infobox - $start_infobox; say '*' x 20; say substr $text, $start_infobox, $length_infobox, ; say '*' x 20; } --output:-- ******************** {{Infobox aaa bbb ccc {{ddd eee fff ggg {{ hhh iii}} jjj}} {{{kkk {{lll}} mmm }}} }} ******************** ******************** {{Infobox aaa2 bbb2 ccc2 {{ddd2 eee2 fff2 ggg2 {{hhh2 iii2}} jjj2}} {{{kkk2 {{lll2 }} mmm2 }}} }} ******************** ******************** {{Infobox 111}} ********************

Here is my grammar with additional actions that I used to to test that the grammar matched the text:

my $grammar = <<'END_OF_GRAMMAR'; { use 5.012; #enable say() use Data::Dumper; } startrule: paragraph(s) paragraph: infobox { say Dumper(\@item); } | word(s) infobox: '{{Infobox' inner_block(s) '}}' { my $inner_blocks = join "", @{$item[2]}; $return = join "\n", $item[1], $inner_blocks, $item[4]; } inner_block: brace_block | word(s) { $return = join "\n", @{$item[1]} ; } brace_block: <rulevar: ($lbraces, $rbraces)> brace_block: lbrace(2..) { $lbraces = join '', @{$item[1]}; $rbraces = "}" x length $lbraces; } inner_block(s) "$rbraces" { #say Dumper(\@item); my $inner_blocks = join "", @{$item[3]}; $return = "$lbraces $inner_blocks $rbraces"; } word: m{ [^{}]+ }xms lbrace: / [{] /xms END_OF_GRAMMAR --output:-- $VAR1 = [ 'paragraph', '{{Infobox aaa bbb ccc {{ ddd eee fff ggg {{ hhh iii }}jjj }}{{{ kkk {{ lll }}mmm }}} ' ]; $VAR1 = [ 'paragraph', '{{Infobox aaa2 bbb2 ccc2 {{ ddd2 eee2 fff2 ggg2 {{ hhh2 iii2 }}jjj2 }}{{{ kkk2 {{ lll2 }}mmm2 }}} ' ]; $VAR1 = [ 'paragraph', '{{Infobox 111 ' ];

Replies are listed 'Best First'.
Re^2: RegEx - Positive Look-ahead
by tmharish (Friar) on Feb 12, 2013 at 14:10 UTC
    Can you briefly describe why?

    I am parsing Wikipedia dumps, one such file is ( WARNING: 41MB file ) this. I rewrote my entire parser to sift through the file one line at a time - which turns out is much faster than loading up chunks and using RegEx on it.