Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

Clarification on greediness

by prasadbabu (Prior)
on Nov 25, 2004 at 12:26 UTC ( [id://410378]=perlquestion: print w/replies, xml ) Need Help??

prasadbabu has asked for the wisdom of the Perl Monks concerning the following question:

Hi monks

I need some clarifications in greediness.

The input file has the following text:

<pages> <contentmodel> <level>5</level> <first>35</first> <content>some text</content> </contentmodel> <contentmodel> <level>4</level> <first>45</first> <content>some text</content> </contentmodel> <contentmodel> <level>3</level> <first>25</first> <content>maps</content> </contentmodel> <contentmodel> <level>2</level> <first>15</first> <content>some text</content> </contentmodel> </pages>

In the above input file i want to fetch the <first>\d+</first> tag in the content model in which the content tag has the text 'maps'.

I tried with the following code using zero width assertions. But it is not matching the minimum match. It is matching from the begining of the <first> tag.

(@arr) = $str =~ /(<first>((?!<first>).*?))<content>maps<\/content>/gsi;

I dont know where i am going wrong.

The output i need is:

<first>25</first>

Thanks in advance

Prasad

janitored by ybiC: Retitle from "greediness" because onewordnodetitles hinder site search

Replies are listed 'Best First'.
Re: Clarification on greediness
by zejames (Hermit) on Nov 25, 2004 at 12:41 UTC
    Just a little suggestion : why don't you use XML perl modules, like XML::Simple, that will make you code clearer, easier to maintain and to read ?

    --
    zejames

      Hi prasadbabu

      I have to agree with zejames that regexes are not the recommended way to parse XML. Especially when XML::Simple is so easy to use (just like it says on the tin). In case you are not familiar with it, here is some sample code, to produce the output you are looking for.

      #!/usr/bin/perl use strict; use warnings FATAL => "all"; use XML::Simple; my $xml = '<pages> <contentmodel> <level>5</level> <first>35</first> <content>some text</content> </contentmodel> <contentmodel> <level>4</level> <first>45</first> <content>some text</content> </contentmodel> <contentmodel> <level>3</level> <first>25</first> <content>maps</content> </contentmodel> <contentmodel> <level>2</level> <first>15</first> <content>some text</content> </contentmodel> </pages> '; my $ref = XMLin($xml); for my $cm(@{$ref->{'contentmodel'}}){ printf "<first>%d</first>\n", $cm->{'first'} if $cm->{'content'} eq 'maps' }; __END__ output <first>25</first>

      cheers

      thinker

Re: Clarification on greediness
by rinceWind (Monsignor) on Nov 25, 2004 at 12:50 UTC
Re: Clarification on greediness
by gopalr (Priest) on Nov 25, 2004 at 12:45 UTC

    You can try out this:

    if ($file=~m#(<first>(?:[^<]+|<(?!/?first>))+</first>)(?:[^<]+|<(?!/?f +irst>))+(<content>maps</content>)#i) { print "\n$1 == $2"; }

    Gopal.R

Re: Clarification on greediness
by fglock (Vicar) on Nov 25, 2004 at 12:48 UTC

    Make it step by step:

    $str = join '' => <DATA>; @arr = map { m!(<first>.*?</first>)!si } # extract first grep { m!<content>maps</content>!si } # filter by content $str =~ m!<contentmodel>(.*?)</contentmodel>!gsi; # split on c +ontentmodel print @arr; __DATA__ <pages> <contentmodel> <level>5</level> <first>35</first> <content>some text</content> </contentmodel> <contentmodel> <level>4</level> <first>45</first> <content>some text</content> </contentmodel> <contentmodel> <level>3</level> <first>25</first> <content>maps</content> </contentmodel> <contentmodel> <level>2</level> <first>15</first> <content>some text</content> </contentmodel> </pages>
Re: Clarification on greediness
by ikegami (Patriarch) on Nov 25, 2004 at 19:26 UTC

    /(<first>((?!<first>).*?))<content>maps<\/content>/
    means: "<first>" not immediately followed by "<first>" followed by ...

    It works if you move the paren:
    /(<first>((?!<first>).)*?)<content>maps<\/content>/

    The ? is not necessary because of the negative lookahead, but I bet it's more efficient to leave it in (less backtracking).

    There's also something screwy with your captures. Why do you have two? The following does more or less what you want:

    $str =~ m% <first> # "<first>" ((?:(?!</first>).)*?) # Capture the text in between. </first> # "</first>" (?:(?!<first>).)*? # In the same record, <content>maps</content> # "content" must be "maps". %xgsi;

    I said it "does more or less what you want" because...

    • What if the file contained a newline after <content>?
    • What if <content> was before <first>?
    • What if maps was written &#97;aps?
    • ...

    Use an XML module. No, really, use an XML module.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://410378]
Approved by zejames
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others learning in the Monastery: (8)
As of 2024-04-16 08:23 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found