Regex priorities.

by Steve_BZ (Chaplain)
on May 28, 2012 at 00:54 UTC ( #972763=perlquestion: print w/replies, xml ) Need Help??
Steve_BZ has asked for the wisdom of the Perl Monks concerning the following question:

Hi Guys,

I have a regex designed to pick out some XML tags.

A typical string might look like this:


I want to process the smallest and inner-most pair of tags first (ie, in this case <EXHIBITING></EXHIBITING>).

I am using the regex:

my $r = '\<(\D+)\>(.*?)(\<\/\1\>)'; while ($loc_diagnoses_text =~ m/$r/gi){ ... processing stuff .... }

But it is processing the <COMMA><COMMA> pair first. How do I fix this?



Re: Regex priorities.
by BrowserUk (Pope) on May 28, 2012 at 01:20 UTC

    Does this match your expectations?

    $s = '<COMMA>pre-stuff<EXHIBITING>some stuff</EXHIBITING>post-stuff</C +OMMA>';; print "$1 :: $2" while $s =~ s[<(\D+)>([^<]*?)</\1>][]gi;; EXHIBITING :: some stuff COMMA :: pre-stuffpost-stuff

    Of course, it fails horribly if your non-tag content contains '<':

    $s = '<COMMA>pre-stuff<EXHIBITING>some <= stuff</EXHIBITING>post-stuff +</COMMA>';; print "$1 :: $2" while $s =~ s[<(\D+)>([^<]*?)</\1>][]gi;; {zilch here}

      Hi BrowserUk,

      Thanks for that. It worked perfectly. I decided to change "<" for "/" because I use it less, in fact I do use "<" for other reasons, but not "/", so I ended up with:

      my $r = q( <(\D+)> # Opening tag <....> ([^/]*?) # stuff in the middle which does not h +ave the closing tab character '/' (<\/\1>) # closing tag of same type as opening +tag </....>. ); while ($text =~ m/$r/gix){ ... processing ... }

      Thanks for your help.



Re: Regex priorities.
by Anonymous Monk on May 28, 2012 at 01:28 UTC

      Hi Anon,

      Thanks for this. I did in fact read most of the links you so kindly posted.

      I also thought it was a bit like a compiler problem and parsing was a potential solution, but I thought it would take longer. I was quite interested in how you would have parsed it.



