raygun has asked for the wisdom of the Perl Monks concerning the following question:

O wise monks, this is probably one of those questions that has a simple, obvious answer that I'm just not seeing because I've been looking at it too long.

I have the following input:

I need to split each line into a package name and a version string. Because the hyphen may occur in either part, I need a regex that is smart enough to figure out which hyphen to split on.

This regex seems to work:

foreach $line (<STDIN>) { chomp $line; $line =~ s/-(?=[^-]+(-r[0-9]+)?$)/ ==> /; print "$line\n"; }
in that it turns the correct hyphen into a more obvious separator, outputting:
mono-basic   ==>   2.10
mono   ==>   2.10.2-r1
mono   ==>   2.10.5
However, if I use the exact same regex in the split function:
foreach $line (<STDIN>) { chomp $line; ($package, $ver) = split /-(?=[^-]+(-r[0-9]+)?$)/, $line; print "$package ==> $ver\n"; }
it doesn't work:
mono-basic   ==>   
mono   ==>   -r1
mono   ==>   
Giving split a LIMIT of 2 doesn't change the output; the actual version numbers are eaten, and only the trailing -r1 on the second line makes it into $ver. What is it about split's processing of the regex that is different from that of the substitute operator?

$ perl --version

This is perl 5, version 12, subversion 3 (v5.12.3) built for i686-linux
(with 13 registered patches, see perl -V for more detail)

Replies are listed 'Best First'.
Re: regex behaves differently in split vs substitute?
by roboticus (Chancellor) on Oct 07, 2011 at 23:43 UTC


    Re-read the perldoc -f split documentation. Since you have a grouping parenthesis in your regex, split is inserting the captured values into the list of values returned. Since '-' is a valid delimiter, and is outside the grouping parenthesis, it's inserting a null string into your list of values.

    $ cat 930260.pl #!/usr/bin/perl use strict; use warnings; while (my $line = <DATA>) { chomp $line; my @flds = split /-(?=[^-]+(-r[0-9]+)?$)/, $line; print join("|", @flds),">\n"; } __DATA__ mono-basic-2.10 mono-2.10.2-r1 mono-2.10.5 $ perl 930260.pl Use of uninitialized value $flds[1] in join or string at 930260.pl lin +e 9, <DATA> line 1. mono-basic||2.10> Use of uninitialized value $flds[3] in join or string at 930260.pl lin +e 9, <DATA> line 2. mono|-r1|2.10.2||r1> Use of uninitialized value $flds[1] in join or string at 930260.pl lin +e 9, <DATA> line 3. mono||2.10.5>


    When your only tool is a hammer, all problems look like your thumb.

    $ cat 930260.pl
Re: regex behaves differently in split vs substitute?
by Marshall (Canon) on Oct 08, 2011 at 02:43 UTC
    Aside from the issues of how split works, it appears to me that you have a situation that is more situated to regex match or regex match global rather than split.

    -In general use split when you know what to throw away and that "throw away separator" is an easy to identify sequence in the input.
    -Use regex when you know what you want to keep and you can either (a) write one regex that describes all the "hunks" that you want or (b) you can enumerate the patterns easily.
    -Sometimes the techniques are best combined and that leads to more complicated regex patterns in the split. As a performance note, in many of my benchmarks, a regex match/match global is faster using a split. A complex regex in a split burdens the "slower but simple" split with something complicated.

    It looks to me like you want to "split" when you see the first "-" that is before a number.. and that really means that a regex match solution is in order rather than a split.

    There are other regex solutions - I don't claim that this is the best, but I do recommend trying to formulate a single forward pass regex (no look ahead or look behind) wherever possible because it will typically be the fastest.

    #!/usr/bin/perl -w use strict; while (<DATA>) { next if /^s*$/; #skip blank lines my ($package,$ver) = /^\s*([a-zA-Z-]+)-(.+)\s*$/; printf "%-15s %s\n", $package,$ver; } =prints mono-basic 2.10 mono 2.10.2-r1 mono 2.10.5 =cut __DATA__ mono-basic-2.10 mono-2.10.2-r1 mono-2.10.5
    Update: if you want to know if the regex succeeded, just check if $ver is defined or not. If $ver is defined, then $package will be also. Oh, there is no need to chomp() because the \s*$ will match and throw the trailing \n character(s) away. And oh, the regex substitution operation is very slow, relative to just "match and capture" because the data has to be copied to "make room" for the new characters - a "substitute and then split" strategy will be slow.
Re: regex behaves differently in split vs substitute?
by leslie (Pilgrim) on Oct 08, 2011 at 05:33 UTC

    You can use this below code for extracting the version

    use strict; use warnings; while (my $line = <DATA>) { chomp $line; if ($line =~ /^[a-z-]+(\d.*)$/) { print ">>$1<<\n"; } } __DATA__ mono-basic-2.10 mono-2.10.2-r1 mono-2.10.5
      That's fine, but I recommend avoiding $1, $2, etc. If you put the left-hand-side in a list context, a variable like $version can be assigned directly without fiddling with $1 as an intermediary. For most folks, $version is easier to understand than just $1.

      Your if() statement is correct, a successful match will return a true/false value. However an assignment to $version like below will return a "defined" or "not defined" value which can also be used in an "if".

      chomp if you like, but adding \s*$ includes \n in the regex (no need for chomp). chomp is "not expensive", but once we whip out the nuclear weapon of regex, asking it to throw away any trailing white space is no big deal.

      use strict; use warnings; while (my $line = <DATA>) { my ($version) = $line =~ /^[a-z-]+(\d.*)\s*$/; print ">>$version<<\n" if $version; } =PRINTS: >>2.10<< >>2.10.2-r1<< >>2.10.5<< =cut __DATA__ mono-basic-2.10 mono-2.10.2-r1 mono-2.10.5

        Eeeeeeewwwwww :P

        #!/usr/bin/perl -- use strict; use warnings; my $dita = <<'__DITA__'; mono-basic-2.10 mono-2.10.2-r1 mono-2.10.5 __DITA__ open my $data => '<', \$dita or die $!; while( my $line = <$data> ){ if( my ($version) = $line =~ /^[a-z-]+(\d.*)\s*$/ ){ print ">>$version<<\n" } } __END__ >>2.10<< >>2.10.2-r1<< >>2.10.5<<