non-greedy piecewise matching

mifflin has asked for the wisdom of the Perl Monks concerning the following question:

I have some data files that were created without newlines that need to be fixed.
The files contain a bunch of records with an xml file name at the end.
They look like...

somedata file.xmlsomedata file.xmlsomedatafile.xml ....

what I want them to be is ...

somedata file.xml
somedata file.xml
somedata file.xml
...

So , i thought I could use a piecewize regex like so...

 pos $data = 0;
 my $len = length $data;
 while (pos $data < $len) {
     if ( my ($line) = $data =~ m{ \G ( .+  \. xml ) }gcxms ) {
         print "$line\n";
     }
 }
[download]

The problem is I cannot figure out how to make the regex non-greedy. My capturing portion matches the full string, all the way to the last xml file. How to I change the regex to be non-greedy and match up to the first xml file?

Comment on non-greedy piecewise matching Download Code

Replies are listed 'Best First'.
Re: non-greedy piecewise matching by ikegami (Patriarch) on Aug 02, 2007 at 17:40 UTC
The greediness is just your first problem. Problem #2: You're using the `g` modifier in list context, causing all the matches to be returned at once. You'll never print anything other than the first file name. `pos $data = 0; my $len = length $data; while (pos $data < $len) { if ( $data =~ m{ \G ( .+? \. xml ) }gcxms ) { print "$1\n"; } }` [download] Problem #3: If there's anything after the last .xml, you have yourself an infinite loop. Checking if `pos` is less then length is a bad idea when using the `c` modifier. Fix: `pos $data = 0; for (;;) { $data =~ m{ \G ( .+? \. xml ) }gcxms or last; print "$1\n"; }` [download] Finally: Using the `c` modifier is rather useless, ugly if you only have one regexp, and it's rather complex (as shown by the number of errors). Fix: `while ( $data =~ m{ \G ( .+? \. xml ) }gxms ) { print "$1\n"; }` [download] Tip: If you really did have a use for `c` (e.g. if you were writting a lexer), then you'd have multiple regexps, and aliasing `$_` to the variable containing the text would be worthwhile. `for ($data) { pos() = 0 for (;;) { /\G ... /xgc && do { ...; next }; /\G ... /xgc && do { ...; next }; /\G ... /xgc && do { ...; next }; last; }` [download]	[reply] [d/l] [select]
Re^2: non-greedy piecewise matching by mifflin (Curate) on Aug 02, 2007 at 19:12 UTC
gads! Now I know why the damian put the following quote at the begining of his chapter... Some people, when confronted with a problem, think: "I know, I'll use regular expressions". Now they have two problems. -- Jamie Zawinski Thanks.	[reply]
Re: non-greedy piecewise matching by NetWallah (Canon) on Aug 02, 2007 at 16:56 UTC
The non-greedy version of "+" is "+?". (perlreref). "An undefined problem has an infinite number of solutions." - Robert A. Humphrey "If you're not part of the solution, you're part of the precipitate." - Henry J. Tillman	[reply]
Re^2: non-greedy piecewise matching by mifflin (Curate) on Aug 02, 2007 at 17:24 UTC
thanks, that worked... `> cat x my $data = 'jlasflsf.xmljlasjlkjlasjflsdf.xmlklajlajlsdfjkl.xml'; while (pos $data < length $data) { if ( $data =~ m{ \G ( .+? \. xml) }gcxms ) { print "$1\n"; } } > perl x jlasflsf.xml jlasjlkjlasjflsdf.xml klajlajlsdfjkl.xml` [download]	[reply] [d/l]
Re: non-greedy piecewise matching by FunkyMonk (Chancellor) on Aug 02, 2007 at 16:58 UTC
What's wrong with the much simpler `s/\.xml/.xml\n/g`?	[reply] [d/l] [select]
Re^2: non-greedy piecewise matching by mifflin (Curate) on Aug 02, 2007 at 17:02 UTC
Nothing, in fact , that's the way i did it because I needed to get the files fixed now. I was just trying out piecewise matching becuse I've never done it before. I've been reading "Perl Best Practices" and was seeing if I could implement something like what was shown on pages 257-258.	[reply]
Re: non-greedy piecewise matching by roboticus (Chancellor) on Aug 02, 2007 at 23:04 UTC
mifflin: Howzabout: `for (split /\.xml/,$data) { print $_ , ".xml\n"; }` [download] ...roboticus	[reply] [d/l]
Re: non-greedy piecewise matching by prasadbabu (Prior) on Aug 02, 2007 at 17:03 UTC
Hi mifflin, You have to use '.+?' instead of '.+' to make non-greediness. Take a look at perlre. As you said, if .xml is present after each records, then we can also use substitution or split function. `$file =~ s/(\.xml)/$1\n/g;` [download] Prasad	[reply] [d/l]

Back to Seekers of Perl Wisdom