http://www.perlmonks.org?node_id=630328

mifflin has asked for the wisdom of the Perl Monks concerning the following question:

I have some data files that were created without newlines that need to be fixed.
The files contain a bunch of records with an xml file name at the end.
They look like...

somedata file.xmlsomedata file.xmlsomedatafile.xml ....

what I want them to be is ...

somedata file.xml
somedata file.xml
somedata file.xml
...

So , i thought I could use a piecewize regex like so...
pos $data = 0; my $len = length $data; while (pos $data < $len) { if ( my ($line) = $data =~ m{ \G ( .+ \. xml ) }gcxms ) { print "$line\n"; } }
The problem is I cannot figure out how to make the regex non-greedy. My capturing portion matches the full string, all the way to the last xml file. How to I change the regex to be non-greedy and match up to the first xml file?

Replies are listed 'Best First'.
Re: non-greedy piecewise matching
by ikegami (Patriarch) on Aug 02, 2007 at 17:40 UTC

    The greediness is just your first problem.

    Problem #2: You're using the g modifier in list context, causing all the matches to be returned at once. You'll never print anything other than the first file name.

    pos $data = 0; my $len = length $data; while (pos $data < $len) { if ( $data =~ m{ \G ( .+? \. xml ) }gcxms ) { print "$1\n"; } }

    Problem #3: If there's anything after the last .xml, you have yourself an infinite loop. Checking if pos is less then length is a bad idea when using the c modifier. Fix:

    pos $data = 0; for (;;) { $data =~ m{ \G ( .+? \. xml ) }gcxms or last; print "$1\n"; }

    Finally: Using the c modifier is rather useless, ugly if you only have one regexp, and it's rather complex (as shown by the number of errors). Fix:

    while ( $data =~ m{ \G ( .+? \. xml ) }gxms ) { print "$1\n"; }

    Tip: If you really did have a use for c (e.g. if you were writting a lexer), then you'd have multiple regexps, and aliasing $_ to the variable containing the text would be worthwhile.

    for ($data) { pos() = 0 for (;;) { /\G ... /xgc && do { ...; next }; /\G ... /xgc && do { ...; next }; /\G ... /xgc && do { ...; next }; last; }
      gads!
      Now I know why the damian put the following quote at the begining of his chapter...

      Some people, when confronted with a problem, think:
      "I know, I'll use regular expressions".
      Now they have two problems.
      -- Jamie Zawinski

      Thanks.

Re: non-greedy piecewise matching
by NetWallah (Canon) on Aug 02, 2007 at 16:56 UTC
    The non-greedy version of "+" is "+?". (perlreref).

         "An undefined problem has an infinite number of solutions." - Robert A. Humphrey         "If you're not part of the solution, you're part of the precipitate." - Henry J. Tillman

      thanks, that worked...
      > cat x my $data = 'jlasflsf.xmljlasjlkjlasjflsdf.xmlklajlajlsdfjkl.xml'; while (pos $data < length $data) { if ( $data =~ m{ \G ( .+? \. xml) }gcxms ) { print "$1\n"; } } > perl x jlasflsf.xml jlasjlkjlasjflsdf.xml klajlajlsdfjkl.xml
Re: non-greedy piecewise matching
by FunkyMonk (Chancellor) on Aug 02, 2007 at 16:58 UTC

    What's wrong with the much simpler s/\.xml/.xml\n/g?

      Nothing, in fact , that's the way i did it because I needed to get the files fixed now. I was just trying out piecewise matching becuse I've never done it before.
      I've been reading "Perl Best Practices" and was seeing if I could implement something like what was shown on pages 257-258.
Re: non-greedy piecewise matching
by roboticus (Chancellor) on Aug 02, 2007 at 23:04 UTC
Re: non-greedy piecewise matching
by prasadbabu (Prior) on Aug 02, 2007 at 17:03 UTC

    Hi mifflin,

    You have to use '.+?' instead of '.+' to make non-greediness. Take a look at perlre.

    As you said, if .xml is present after each records, then we can also use substitution or split function.

    $file =~ s/(\.xml)/$1\n/g;

    Prasad