Clarification on greediness

prasadbabu has asked for the wisdom of the Perl Monks concerning the following question:

Hi monks

I need some clarifications in greediness.

The input file has the following text:

<pages>
<contentmodel>
<level>5</level>
<first>35</first>
<content>some text</content>
</contentmodel>
<contentmodel>
<level>4</level>
<first>45</first>
<content>some text</content>
</contentmodel>
<contentmodel>
<level>3</level>
<first>25</first>
<content>maps</content>
</contentmodel>
<contentmodel>
<level>2</level>
<first>15</first>
<content>some text</content>
</contentmodel>
</pages>
[download]

In the above input file i want to fetch the <first>\d+</first> tag in the content model in which the content tag has the text 'maps'.

I tried with the following code using zero width assertions. But it is not matching the minimum match. It is matching from the begining of the <first> tag.

(@arr) = $str =~ /(<first>((?!<first>).*?))<content>maps<\/content>/gsi;

I dont know where i am going wrong.

The output i need is:

<first>25</first>

Thanks in advance

Prasad

janitored by ybiC: Retitle from "greediness" because onewordnodetitles hinder site search

Comment on Clarification on greediness Select or Download Code

Replies are listed 'Best First'.
Re: Clarification on greediness by zejames (Hermit) on Nov 25, 2004 at 12:41 UTC
Just a little suggestion : why don't you use XML perl modules, like XML::Simple, that will make you code clearer, easier to maintain and to read ? -- zejames	[reply]
Re^2: Clarification on greediness by thinker (Parson) on Nov 25, 2004 at 14:03 UTC
Hi prasadbabu I have to agree with zejames that regexes are not the recommended way to parse XML. Especially when XML::Simple is so easy to use (just like it says on the tin). In case you are not familiar with it, here is some sample code, to produce the output you are looking for. #!/usr/bin/perl use strict; use warnings FATAL => "all"; use XML::Simple; my $xml = '<pages> <contentmodel> <level>5</level> <first>35</first> <content>some text</content> </contentmodel> <contentmodel> <level>4</level> <first>45</first> <content>some text</content> </contentmodel> <contentmodel> <level>3</level> <first>25</first> <content>maps</content> </contentmodel> <contentmodel> <level>2</level> <first>15</first> <content>some text</content> </contentmodel> </pages> '; my $ref = XMLin($xml); for my $cm(@{$ref->{'contentmodel'}}){ printf "<first>%d</first>\n", $cm->{'first'} if $cm->{'content'} eq 'maps' }; __END__ output <first>25</first> [download] cheers thinker	[reply] [d/l]
Re: Clarification on greediness by rinceWind (Monsignor) on Nov 25, 2004 at 12:50 UTC
You might be interested in my answer Re: Why doens't non-greediness work?. A super search on greed will find this, and other examples. -- I'm Not Just Another Perl Hacker	[reply]
Re: Clarification on greediness by gopalr (Priest) on Nov 25, 2004 at 12:45 UTC
You can try out this: `if ($file=~m#(<first>(?:[^<]+\|<(?!/?first>))+</first>)(?:[^<]+\|<(?!/?f +irst>))+(<content>maps</content>)#i) { print "\n$1 == $2"; }` [download] Gopal.R	[reply] [d/l]
Re: Clarification on greediness by fglock (Vicar) on Nov 25, 2004 at 12:48 UTC
Make it step by step: $str = join '' => <DATA>; @arr = map { m!(<first>.?</first>)!si } # extract first grep { m!<content>maps</content>!si } # filter by content $str =~ m!<contentmodel>(.?)</contentmodel>!gsi; # split on c +ontentmodel print @arr; __DATA__ <pages> <contentmodel> <level>5</level> <first>35</first> <content>some text</content> </contentmodel> <contentmodel> <level>4</level> <first>45</first> <content>some text</content> </contentmodel> <contentmodel> <level>3</level> <first>25</first> <content>maps</content> </contentmodel> <contentmodel> <level>2</level> <first>15</first> <content>some text</content> </contentmodel> </pages> [download]	[reply] [d/l]
Re: Clarification on greediness by ikegami (Patriarch) on Nov 25, 2004 at 19:26 UTC
`/(<first>((?!<first>).?))<content>maps<\/content>/` means: "<first>" not immediately followed by "<first>" followed by ... It works if you move the paren: `/(<first>((?!<first>).)?)<content>maps<\/content>/` The `?` is not necessary because of the negative lookahead, but I bet it's more efficient to leave it in (less backtracking). There's also something screwy with your captures. Why do you have two? The following does more or less what you want: `$str =~ m% <first> # "<first>" ((?:(?!</first>).)?) # Capture the text in between. </first> # "</first>" (?:(?!<first>).)? # In the same record, <content>maps</content> # "content" must be "maps". %xgsi;` [download] I said it "does more or less what you want" because... What if the file contained a newline after `<content>`? What if `<content>` was before `<first>`? What if `maps` was written `aaps`? ... Use an XML module. No, really, use an XML module.	[reply] [d/l] [select]


good chemistry is complicated, and a little bit messy -LW
	PerlMonks