Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

Text::Balanced with nested / custom brackets

by Anonymous Monk
on Sep 07, 2006 at 20:35 UTC ( #571798=perlquestion: print w/ replies, xml ) Need Help??
Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hello monks, I've been struggling with this for a couple days, though I don't think the problem is very difficult. Within: 'this is a [[link with a [nested]]]' I would like to extract the entire parent [..]. There will be multiple instances, not all with nested brackets. What is the correct code syntax? I've tried all kinds of things without much luck. I'm assuming it's with extract_multiple and an extract ref, but can't figure out what exactly to do: my @data = extract_multiple( $text, ???? ); FYI, this is all toward parsing raw wiki text. Thanks in advance.

Comment on Text::Balanced with nested / custom brackets
Reaped: Re: Text::Balanced with nested / custom brackets
by NodeReaper (Curate) on Sep 07, 2006 at 20:37 UTC
Re: Text::Balanced with nested / custom brackets
by Skeeve (Vicar) on Sep 07, 2006 at 21:03 UTC
    Did you try to search for a module on CPAN? Maybe there is something for parsing wiki text

    s$$([},&%#}/&/]+}%&{})*;#$&&s&&$^X.($'^"%]=\&(|?*{%
    +.+=%;.#_}\&"^"-+%*).}%:##%}={~=~:.")&e&&s""`$''`"e
      Yes, I did. All of the Wiki modules I found on CPAN are terrible. There doesn't seem to be any one flexible/powerful parser.
Re: Text::Balanced with nested / custom brackets
by ikegami (Pope) on Sep 07, 2006 at 21:48 UTC
    As far as I can tell, Text::Balanced deals with single-character delimiters, whereas your delimiter has two. You might have to resort to using a regexp.
    my $extractor; # Must be a seperate statement. $extractor = qr/ \[\[ (?: (?: (?! \[\[ | \]\] ) . )+ | (??{ $extractor }) )+ \]\] /x; my @links = $text =~ /$extractor/g;

    Optimized (I think):

    my $extractor; # Must be a seperate statement. $extractor = qr/ \[\[ (?> (?: (?: (?> [^\[\]]+ ) | \[ (?! \[ ) | \] (?! \] ) ) | (??{ $extractor }) )+ ) \]\] /x; my @links = $text =~ /$extractor/g;

    Tested.

      Thank you for those regexes! I'll play around with them to see if I can get myself moving.

      In the long run, I'd still like to know if Text::Balanced can be massaged into dealing with this situation. It does deal with <tags> and such..

        The function to extract tagged data can indeed be used.

        my @links; my $extractor = gen_extract_tagged('[[', ']]', qr/(?:(?!\[\[).)*/); for (;;) { (my $link, $text) = $extractor->($text); last if not defined $link; push(@links, $link); }

        Untested.

      It's worth noting that ikegami's code is essentially a derivation of code in perlre for matching balanced parens:
      $re = qr{ \( (?: (?> [^()]+ ) # Non-parens without backtracking | (??{ $re }) # Group with matching parens )* \) }x;

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://571798]
Approved by GrandFather
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others lurking in the Monastery: (4)
As of 2014-10-21 05:44 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    For retirement, I am banking on:










    Results (96 votes), past polls