Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

Re: RegEx - Positive Look-ahead

by 7stud (Deacon)
on Feb 08, 2013 at 03:08 UTC ( #1017749=note: print w/ replies, xml ) Need Help??


in reply to RegEx - Positive Look-ahead

Considering your other post ( which might or might not have stemmed from this

Yes it did. I've been working on another solution...one using Parse::RecDescent, which I finally finished.

I found that, considering {{Infobox was not the only chunk I needed, I was taking a huge performance hit.

Can you briefly describe why?

To avoid this I changed to a single sweep of the ( long ) text chunk as follows

And of course, you can count your own brackets--I figured that is what Text::Balanced does.

I'll offer up my Parse::RecDescent solution for the experts to comment on. After I got my grammar to match the text, I wanted to preserve the formatting of the original text, so I used an approach that records the positions in the text where the start and end of an infobox was found. Then I used substr() on the original text.

I also thought it might help someone to see all the shenanigans I had to go through to check if my grammar matched. The grammar with all the actions I employed follows the finished program.

use strict; use warnings; use 5.012; use Parse::RecDescent; $::RD_ERRORS = 1; #Parser dies when it encounters an error $::RD_WARN = 1; #Enable warnings - warn on unused rules &c. $::RD_HINT = 1; #Give out hints to help fix problems. #$::RD_TRACE = 1; #Trace parsers' behaviour my $text = <<'END_OF_TEXT'; {{Infobox aaa bbb ccc {{ddd eee fff ggg {{ hhh iii}} jjj}} {{{kkk {{lll}} mmm }}} }} no no no no no no no no no {{Infobox aaa2 bbb2 ccc2 {{ddd2 eee2 fff2 ggg2 {{hhh2 iii2}} jjj2}} {{{kkk2 {{lll2 }} mmm2 }}} }} {{Infobox 111}} END_OF_TEXT #Declare a global variable that can be loaded with #data from inside the parser: our @infobox_offsets; my $grammar = <<'END_OF_GRAMMAR'; { use 5.012; #enable say() use Data::Dumper; } startrule: paragraph(s) paragraph: infobox | word(s) infobox: '{{Infobox' inner_block(s) '}}' { push @main::infobox_offsets, $itempos[1]->{offset}{from}, $itempos[3]->{offset}{to}, ; } inner_block: brace_block | word(s) #Declare some my variables ('rulevars') for this rule: brace_block: <rulevar: ($lbraces, $rbraces)> brace_block: lbrace(2..) { $lbraces = join '', @{$item[1]}; $rbraces = "}" x length $lbraces; } inner_block(s) "$rbraces" word: m{ [^{}]+ }xms lbrace: / [{] /xms END_OF_GRAMMAR my $parser = Parse::RecDescent->new($grammar) or die "Bad grammar!\n"; defined $parser->startrule($text) or die "Can't match text"; #Using the recorded offsets for the infoboxes #print out the infobox substr()'s: my ( $start_infobox, $end_infobox, $length_infobox ); while (@infobox_offsets) { $start_infobox = shift @infobox_offsets; $end_infobox = shift @infobox_offsets; $length_infobox = 1 + $end_infobox - $start_infobox; say '*' x 20; say substr $text, $start_infobox, $length_infobox, ; say '*' x 20; } --output:-- ******************** {{Infobox aaa bbb ccc {{ddd eee fff ggg {{ hhh iii}} jjj}} {{{kkk {{lll}} mmm }}} }} ******************** ******************** {{Infobox aaa2 bbb2 ccc2 {{ddd2 eee2 fff2 ggg2 {{hhh2 iii2}} jjj2}} {{{kkk2 {{lll2 }} mmm2 }}} }} ******************** ******************** {{Infobox 111}} ********************

Here is my grammar with additional actions that I used to to test that the grammar matched the text:

my $grammar = <<'END_OF_GRAMMAR'; { use 5.012; #enable say() use Data::Dumper; } startrule: paragraph(s) paragraph: infobox { say Dumper(\@item); } | word(s) infobox: '{{Infobox' inner_block(s) '}}' { my $inner_blocks = join "", @{$item[2]}; $return = join "\n", $item[1], $inner_blocks, $item[4]; } inner_block: brace_block | word(s) { $return = join "\n", @{$item[1]} ; } brace_block: <rulevar: ($lbraces, $rbraces)> brace_block: lbrace(2..) { $lbraces = join '', @{$item[1]}; $rbraces = "}" x length $lbraces; } inner_block(s) "$rbraces" { #say Dumper(\@item); my $inner_blocks = join "", @{$item[3]}; $return = "$lbraces $inner_blocks $rbraces"; } word: m{ [^{}]+ }xms lbrace: / [{] /xms END_OF_GRAMMAR --output:-- $VAR1 = [ 'paragraph', '{{Infobox aaa bbb ccc {{ ddd eee fff ggg {{ hhh iii }}jjj }}{{{ kkk {{ lll }}mmm }}} ' ]; $VAR1 = [ 'paragraph', '{{Infobox aaa2 bbb2 ccc2 {{ ddd2 eee2 fff2 ggg2 {{ hhh2 iii2 }}jjj2 }}{{{ kkk2 {{ lll2 }}mmm2 }}} ' ]; $VAR1 = [ 'paragraph', '{{Infobox 111 ' ];


Comment on Re: RegEx - Positive Look-ahead
Select or Download Code
Re^2: RegEx - Positive Look-ahead
by tmharish (Friar) on Feb 12, 2013 at 14:10 UTC
    Can you briefly describe why?

    I am parsing Wikipedia dumps, one such file is ( WARNING: 41MB file ) this. I rewrote my entire parser to sift through the file one line at a time - which turns out is much faster than loading up chunks and using RegEx on it.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1017749]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others wandering the Monastery: (11)
As of 2015-07-07 10:49 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (88 votes), past polls