Considering your other post ( which might or might not have stemmed from this
Yes it did. I've been working on another solution...one using Parse::RecDescent, which I finally finished.
I found that, considering {{Infobox was not the only chunk I needed, I was taking a huge performance hit.
Can you briefly describe why?
To avoid this I changed to a single sweep of the ( long ) text chunk as follows
And of course, you can count your own brackets--I figured that is what Text::Balanced does.
I'll offer up my Parse::RecDescent solution for the experts to comment on. After I got my grammar to match the text, I wanted to preserve the formatting of the original text, so I used an approach that records the positions in the text where the start and end of an infobox was found. Then I used substr() on the original text.
I also thought it might help someone to see all the shenanigans I had to go through to check if my grammar matched. The grammar with all the actions I employed follows the finished program.
use strict;
use warnings;
use 5.012;
use Parse::RecDescent;
$::RD_ERRORS = 1; #Parser dies when it encounters an error
$::RD_WARN = 1; #Enable warnings - warn on unused rules &c.
$::RD_HINT = 1; #Give out hints to help fix problems.
#$::RD_TRACE = 1; #Trace parsers' behaviour
my $text = <<'END_OF_TEXT';
{{Infobox
aaa bbb ccc
{{ddd eee fff
ggg {{ hhh iii}}
jjj}}
{{{kkk {{lll}}
mmm }}}
}}
no no no
no no no
no no no
{{Infobox
aaa2 bbb2 ccc2
{{ddd2 eee2 fff2
ggg2 {{hhh2 iii2}}
jjj2}}
{{{kkk2 {{lll2 }}
mmm2 }}}
}}
{{Infobox 111}}
END_OF_TEXT
#Declare a global variable that can be loaded with
#data from inside the parser:
our @infobox_offsets;
my $grammar = <<'END_OF_GRAMMAR';
{
use 5.012; #enable say()
use Data::Dumper;
}
startrule: paragraph(s)
paragraph: infobox
| word(s)
infobox: '{{Infobox'
inner_block(s)
'}}'
{
push @main::infobox_offsets,
$itempos[1]->{offset}{from},
$itempos[3]->{offset}{to},
;
}
inner_block: brace_block
| word(s)
#Declare some my variables ('rulevars') for this rule:
brace_block: <rulevar: ($lbraces, $rbraces)>
brace_block: lbrace(2..)
{
$lbraces = join '', @{$item[1]};
$rbraces = "}" x length $lbraces;
}
inner_block(s)
"$rbraces"
word: m{ [^{}]+ }xms
lbrace: / [{] /xms
END_OF_GRAMMAR
my $parser = Parse::RecDescent->new($grammar)
or die "Bad grammar!\n";
defined $parser->startrule($text)
or die "Can't match text";
#Using the recorded offsets for the infoboxes
#print out the infobox substr()'s:
my ( $start_infobox,
$end_infobox,
$length_infobox
);
while (@infobox_offsets) {
$start_infobox = shift @infobox_offsets;
$end_infobox = shift @infobox_offsets;
$length_infobox = 1 + $end_infobox - $start_infobox;
say '*' x 20;
say substr $text,
$start_infobox,
$length_infobox,
;
say '*' x 20;
}
--output:--
********************
{{Infobox
aaa bbb ccc
{{ddd eee fff
ggg {{ hhh iii}}
jjj}}
{{{kkk {{lll}}
mmm }}}
}}
********************
********************
{{Infobox
aaa2 bbb2 ccc2
{{ddd2 eee2 fff2
ggg2 {{hhh2 iii2}}
jjj2}}
{{{kkk2 {{lll2 }}
mmm2 }}}
}}
********************
********************
{{Infobox 111}}
********************
Here is my grammar with additional actions that I used to to test that the grammar matched the text:
my $grammar = <<'END_OF_GRAMMAR';
{
use 5.012; #enable say()
use Data::Dumper;
}
startrule: paragraph(s)
paragraph: infobox
{ say Dumper(\@item); }
| word(s)
infobox: '{{Infobox'
inner_block(s)
'}}'
{
my $inner_blocks = join "", @{$item[2]};
$return = join "\n", $item[1],
$inner_blocks,
$item[4];
}
inner_block: brace_block
| word(s)
{ $return = join "\n", @{$item[1]} ; }
brace_block: <rulevar: ($lbraces, $rbraces)>
brace_block: lbrace(2..)
{
$lbraces = join '', @{$item[1]};
$rbraces = "}" x length $lbraces;
}
inner_block(s)
"$rbraces"
{
#say Dumper(\@item);
my $inner_blocks = join "", @{$item[3]};
$return = "$lbraces $inner_blocks $rbraces";
}
word: m{ [^{}]+ }xms
lbrace: / [{] /xms
END_OF_GRAMMAR
--output:--
$VAR1 = [
'paragraph',
'{{Infobox
aaa bbb ccc
{{ ddd eee fff
ggg {{ hhh iii }}jjj }}{{{ kkk {{ lll }}mmm }}}
'
];
$VAR1 = [
'paragraph',
'{{Infobox
aaa2 bbb2 ccc2
{{ ddd2 eee2 fff2
ggg2 {{ hhh2 iii2 }}jjj2 }}{{{ kkk2 {{ lll2 }}mmm2 }}}
'
];
$VAR1 = [
'paragraph',
'{{Infobox
111
'
];
|