http://www.perlmonks.org?node_id=1017155

tmharish has asked for the wisdom of the Perl Monks concerning the following question:

I am using the following to extract data with a {{Infobox ... }} block, with the catch being that there might be {{ ... }} blocks within it:

use strict ; use warnings ; use Data::Dump qw( dump ) ; my @data = <DATA> ; my $data = join( '', @data ) ; my @info_box_contents ; while( $data =~ m/( {{Infobox .*? ({{(?=}}))? .*? }} )/xsg ) { print STDERR "MATCHED\n"; push @info_box_contents, $1 ; } dump( \@info_box_contents ) ; __DATA__ {{Infobox text text text {{text text text text {{text text}} text}} {{{text {{text }} text }}} }} blah blah blah blah blah blah {{Infobox text1 text1 text1 {{text1 text1 text1 text1 {{text1 text1}} text1}} {{{text1 {{text1 }} text1 }}} }} {{Infobox one}}

I get the following output:

MATCHED MATCHED MATCHED [ "{{Infobox\n text text text \n {{text text text \n text {{text text} +}", "{{Infobox\n text1 text1 text1 \n {{text1 text1 text1 \n text1 {{tex +t1 text1}}", "{{Infobox one}}", ]

I am expecting it to match the entire block up to the first mismatched '}}'

Whats more if I remove the '?' from my look-ahead and change that line from  ({{(?=}}))? to  ({{(?=}})) I match nothing.

Help will be greatly appreciated.

Replies are listed 'Best First'.
Re: RegEx - Positive Look-ahead
by 7stud (Deacon) on Feb 05, 2013 at 20:28 UTC
    Is this what you want???
    use strict; use warnings; use 5.012; use Text::Balanced qw( extract_tagged extract_multiple ); my $text = <<'END_OF_STRING'; {{Infobox text text text {{text text text text {{text text}} text}} {{{text {{text }} text }}} }} blah blah blah blah blah blah {{Infobox text1 text1 text1 {{text1 text1 text1 text1 {{text1 text1}} text1}} {{{text1 {{text1 }} text1 }}} }} {{Infobox one}} END_OF_STRING my @infoboxes = extract_multiple( $text, [ \&my_extractor], undef, 1 ) +; sub my_extractor { extract_tagged( $text, "{{", "}}", ); } for my $infobox (@infoboxes) { say $infobox; say '*' x 20; } --output:-- {{Infobox text text text {{text text text text {{text text}} text}} {{{text {{text }} text }}} }} ******************** {{Infobox text1 text1 text1 {{text1 text1 text1 text1 {{text1 text1}} text1}} {{{text1 {{text1 }} text1 }}} }} ******************** {{Infobox one}} ********************

    Here's the same result using regexes via Regexp::Common:

    use strict; use warnings; use 5.012; use Regexp::Common qw( balanced ); my $text = <<'END_OF_STRING'; {{Infobox text text text {{text text text text {{text text}} text}} {{{text {{text }} text }}} }} blah blah blah blah blah blah {{Infobox text1 text1 text1 {{text1 text1 text1 text1 {{text1 text1}} text1}} {{{text1 {{text1 }} text1 }}} }} {{Infobox one}} END_OF_STRING my $pattern = $RE{ balanced } { -begin => '{{' } { -end => '}}' }; while ($text =~ /($pattern)/gxms) { say $1; say '*' x 20; } --output:-- {{Infobox text text text {{text text text text {{text text}} text}} {{{text {{text }} text }}} }} ******************** {{Infobox text1 text1 text1 {{text1 text1 text1 text1 {{text1 text1}} text1}} {{{text1 {{text1 }} text1 }}} }} ******************** {{Infobox one}} ********************

      Works like a charm - Thank you very much.

      Also this was really helpful and this has been noted.

      7stud

      Considering your other post ( which might or might not have stemmed from this ) I thought I would update this thread with the final solution that I used ( also for anyone else who might care ).

      I found that, considering {{Infobox was not the only chunk I needed, I was taking a huge performance hit. To avoid this I changed to a single sweep of the ( long ) text chunk as follows - I have removed the other parts that I extracted in the same sweep so as to stick to the OP topic.

      use strict ; use warnings ; use Data::Dump qw( dump ) ; my $text = <<'END_OF_STRING'; {{Infobox text text text {{text text text text {{text text}} text}} {{{text {{text }} text }}} END}} blah blah blah blah blah blah {{Infobox text1 text1 text1 {{text1 text1 text1 text1 {{text1 text1}} text1}} {{{text1 {{text1 }} text1 }}} }} {{Infobox one}} END_OF_STRING my $box_contents = _get_info_boxes( $text ) ; dump( $box_contents ) ; exit; sub _get_info_boxes { my $text = shift ; my @info_box_contents ; my $in_info_box ; my $this_info_box_content = "" ; my $bracket_count = 0 ; foreach my $line ( split( /\n/, $text ) ) { unless( $in_info_box ) { next unless( $line =~ /{{Infobox/ ) ; $in_info_box = 1 ; } $this_info_box_content .= $line . "\n" ; my $open_count = ( $line =~ tr/{// ) ; my $close_count = ( $line =~ tr/}// ) ; $bracket_count = $bracket_count + $open_count - $close_count ; if( $bracket_count == 0 ) { push @info_box_contents, $this_info_box_content ; $this_info_box_content = "" ; $in_info_box = 0 ; $bracket_count = 0 ; } } return \@info_box_contents ; }
Re: RegEx - Positive Look-ahead
by sundialsvc4 (Abbot) on Feb 05, 2013 at 13:39 UTC

    To me, this is the sort of problem that should be described in terms of a grammar, and then handled using Parse::RecDescent.   (I have used that module quite extensively and it works very well.)

    Let me give you one small tip if you go that way...   Part of your code will consist of the grammar and Perl code that will be compiled on-the-fly into it.   This is a great place to put a use statement(s) that will link to any subroutines that you find that you need to make use of in the grammar code.

    Anyway, the advantage of this approach is that you describe the structure of the language being processed and let the parser do the magic.   It takes a bit of practice to get the grammar right, heh, but the parser can handle the mechanics of regex’ing and backtracking so that your code doesn’t have to.   You can extend it to do more things without getting buried.

      Thanks sundialsvc4

      Am going through the documentation of that module.

      Will reply once I have completed that.

        I just had my first run in with Parse::RecDescent, and after I figured out the basics from the docs, I posted some beginner tips here link.

Re: RegEx - Positive Look-ahead
by 7stud (Deacon) on Feb 05, 2013 at 18:42 UTC
    This regex:

    /[aaaaa]/

    is equivalent to:

    /[a]/

    So this regex:

    [^{{}}]

    Is equivalent to:

    [^{}]

    I am expecting it to match the entire block up to the first mismatched '}}'

    Huh?

    First, you said this:

    I am using the following to extract data with a {{Infobox ... }} block, with the catch being that there might be {{ ... }} blocks within it.

    Then you said this:

    The problem is that this does not match if I have {{ something {{{ text }}} }} in my content ...

    Here is a regex that will 'match' both:

    /.*/

    To ask a regex question that is specific enough to get a relevant answer, you need to post:

    1. An example of your text.
    2. The exact text you want to end up with.

    Note that for 2), you DO NOT post a description of the text that you want to end up with. Why? Because descriptions are usually gibberish, and at best they are subject to different interpretations. If required, repeat steps 1) and 2) as many times as needed to highlight the twists and turns in the text you need to match.

Re: RegEx - Positive Look-ahead
by Anonymous Monk on Feb 05, 2013 at 13:54 UTC

      Thank You!

      I modified my RegEx like so:

      ( # start of capture group 1 {{ # match an opening (?: [^{{}}]++ # one or more, non backtracking | (?1) # found {{ or }}, so recurse to captur +e group 1 )* }} # match a closing ) # end of capture group 1

      The problem is that this does not match if I have {{ something {{{ text }}} }} in my content ...

        if you want to match {{{ and {{ maybe you can make it
        [{]{2,3} # opener [}]{2,3} # closer
Re: RegEx - Positive Look-ahead
by 7stud (Deacon) on Feb 08, 2013 at 03:08 UTC

    Considering your other post ( which might or might not have stemmed from this

    Yes it did. I've been working on another solution...one using Parse::RecDescent, which I finally finished.

    I found that, considering {{Infobox was not the only chunk I needed, I was taking a huge performance hit.

    Can you briefly describe why?

    To avoid this I changed to a single sweep of the ( long ) text chunk as follows

    And of course, you can count your own brackets--I figured that is what Text::Balanced does.

    I'll offer up my Parse::RecDescent solution for the experts to comment on. After I got my grammar to match the text, I wanted to preserve the formatting of the original text, so I used an approach that records the positions in the text where the start and end of an infobox was found. Then I used substr() on the original text.

    I also thought it might help someone to see all the shenanigans I had to go through to check if my grammar matched. The grammar with all the actions I employed follows the finished program.

    use strict; use warnings; use 5.012; use Parse::RecDescent; $::RD_ERRORS = 1; #Parser dies when it encounters an error $::RD_WARN = 1; #Enable warnings - warn on unused rules &c. $::RD_HINT = 1; #Give out hints to help fix problems. #$::RD_TRACE = 1; #Trace parsers' behaviour my $text = <<'END_OF_TEXT'; {{Infobox aaa bbb ccc {{ddd eee fff ggg {{ hhh iii}} jjj}} {{{kkk {{lll}} mmm }}} }} no no no no no no no no no {{Infobox aaa2 bbb2 ccc2 {{ddd2 eee2 fff2 ggg2 {{hhh2 iii2}} jjj2}} {{{kkk2 {{lll2 }} mmm2 }}} }} {{Infobox 111}} END_OF_TEXT #Declare a global variable that can be loaded with #data from inside the parser: our @infobox_offsets; my $grammar = <<'END_OF_GRAMMAR'; { use 5.012; #enable say() use Data::Dumper; } startrule: paragraph(s) paragraph: infobox | word(s) infobox: '{{Infobox' inner_block(s) '}}' { push @main::infobox_offsets, $itempos[1]->{offset}{from}, $itempos[3]->{offset}{to}, ; } inner_block: brace_block | word(s) #Declare some my variables ('rulevars') for this rule: brace_block: <rulevar: ($lbraces, $rbraces)> brace_block: lbrace(2..) { $lbraces = join '', @{$item[1]}; $rbraces = "}" x length $lbraces; } inner_block(s) "$rbraces" word: m{ [^{}]+ }xms lbrace: / [{] /xms END_OF_GRAMMAR my $parser = Parse::RecDescent->new($grammar) or die "Bad grammar!\n"; defined $parser->startrule($text) or die "Can't match text"; #Using the recorded offsets for the infoboxes #print out the infobox substr()'s: my ( $start_infobox, $end_infobox, $length_infobox ); while (@infobox_offsets) { $start_infobox = shift @infobox_offsets; $end_infobox = shift @infobox_offsets; $length_infobox = 1 + $end_infobox - $start_infobox; say '*' x 20; say substr $text, $start_infobox, $length_infobox, ; say '*' x 20; } --output:-- ******************** {{Infobox aaa bbb ccc {{ddd eee fff ggg {{ hhh iii}} jjj}} {{{kkk {{lll}} mmm }}} }} ******************** ******************** {{Infobox aaa2 bbb2 ccc2 {{ddd2 eee2 fff2 ggg2 {{hhh2 iii2}} jjj2}} {{{kkk2 {{lll2 }} mmm2 }}} }} ******************** ******************** {{Infobox 111}} ********************

    Here is my grammar with additional actions that I used to to test that the grammar matched the text:

    my $grammar = <<'END_OF_GRAMMAR'; { use 5.012; #enable say() use Data::Dumper; } startrule: paragraph(s) paragraph: infobox { say Dumper(\@item); } | word(s) infobox: '{{Infobox' inner_block(s) '}}' { my $inner_blocks = join "", @{$item[2]}; $return = join "\n", $item[1], $inner_blocks, $item[4]; } inner_block: brace_block | word(s) { $return = join "\n", @{$item[1]} ; } brace_block: <rulevar: ($lbraces, $rbraces)> brace_block: lbrace(2..) { $lbraces = join '', @{$item[1]}; $rbraces = "}" x length $lbraces; } inner_block(s) "$rbraces" { #say Dumper(\@item); my $inner_blocks = join "", @{$item[3]}; $return = "$lbraces $inner_blocks $rbraces"; } word: m{ [^{}]+ }xms lbrace: / [{] /xms END_OF_GRAMMAR --output:-- $VAR1 = [ 'paragraph', '{{Infobox aaa bbb ccc {{ ddd eee fff ggg {{ hhh iii }}jjj }}{{{ kkk {{ lll }}mmm }}} ' ]; $VAR1 = [ 'paragraph', '{{Infobox aaa2 bbb2 ccc2 {{ ddd2 eee2 fff2 ggg2 {{ hhh2 iii2 }}jjj2 }}{{{ kkk2 {{ lll2 }}mmm2 }}} ' ]; $VAR1 = [ 'paragraph', '{{Infobox 111 ' ];
      Can you briefly describe why?

      I am parsing Wikipedia dumps, one such file is ( WARNING: 41MB file ) this. I rewrote my entire parser to sift through the file one line at a time - which turns out is much faster than loading up chunks and using RegEx on it.