http://www.perlmonks.org?node_id=972763

Steve_BZ has asked for the wisdom of the Perl Monks concerning the following question:

Hi Guys,

I have a regex designed to pick out some XML tags.

A typical string might look like this:

<COMMA><EXHIBITING></EXHIBITING></COMMA>

I want to process the smallest and inner-most pair of tags first (ie, in this case <EXHIBITING></EXHIBITING>).

I am using the regex:

my $r = '\<(\D+)\>(.*?)(\<\/\1\>)'; while ($loc_diagnoses_text =~ m/$r/gi){ ... processing stuff .... }

But it is processing the <COMMA><COMMA> pair first. How do I fix this?

Regards

Steve

Replies are listed 'Best First'.
Re: Regex priorities.
by BrowserUk (Patriarch) on May 28, 2012 at 01:20 UTC

    Does this match your expectations?

    $s = '<COMMA>pre-stuff<EXHIBITING>some stuff</EXHIBITING>post-stuff</C +OMMA>';; print "$1 :: $2" while $s =~ s[<(\D+)>([^<]*?)</\1>][]gi;; EXHIBITING :: some stuff COMMA :: pre-stuffpost-stuff

    Of course, it fails horribly if your non-tag content contains '<':

    $s = '<COMMA>pre-stuff<EXHIBITING>some <= stuff</EXHIBITING>post-stuff +</COMMA>';; print "$1 :: $2" while $s =~ s[<(\D+)>([^<]*?)</\1>][]gi;; {zilch here}

    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    The start of some sanity?

      Hi BrowserUk,

      Thanks for that. It worked perfectly. I decided to change "<" for "/" because I use it less, in fact I do use "<" for other reasons, but not "/", so I ended up with:

      my $r = q( <(\D+)> # Opening tag <....> ([^/]*?) # stuff in the middle which does not h +ave the closing tab character '/' (<\/\1>) # closing tag of same type as opening +tag </....>. ); while ($text =~ m/$r/gix){ ... processing ... }

      Thanks for your help.

      Regards

      Steve

Re: Regex priorities.
by Anonymous Monk on May 28, 2012 at 01:28 UTC

      Hi Anon,

      Thanks for this. I did in fact read most of the links you so kindly posted.

      I also thought it was a bit like a compiler problem and parsing was a potential solution, but I thought it would take longer. I was quite interested in how you would have parsed it.

      Regards

      Steve