http://www.perlmonks.org?node_id=731138

puterboy has asked for the wisdom of the Perl Monks concerning the following question:

Specifically, I want to "split" a string based on a (variable) regexp I am passing, *provided* that the match is less than N characters long

If it weren't in a split I could in general do "conjunction" by using two separate matches tied together with an "&&".

If I knew the regexp in advance, I could potentially refashion it to include the length restriction intrinsically.

But I don't and I'm at a loss on how or whether I can do this within a single "/PATTERN/" expression as required by split.

And I would prefer to avoid any of the "highly experimental" options in perlre since I don't want to wake up one morning after an upgrade only to find that my code no longer works (and not knowing why...)

  • Comment on Can you do "conjunctive" (overlapping) conditions in a single regexp?

Replies are listed 'Best First'.
Re: Can you do "conjunctive" (overlapping) conditions in a single regexp?
by tilly (Archbishop) on Dec 18, 2008 at 03:00 UTC
    perlre describes 2 experimental code features that can do this. Albeit in an ugly way:
    @x = split /(?{ # This populates $^R pos() })some pattern(??{ # This returns an impossible pattern for long matches (pos()-$^R < 5) ? "" : "no\\bmatch" })/, $string;
    Be warned though that this will backtrack and try to match again. For instance if you want to not match more than 4 ,'s, this will match 4 out of 6 commas, then will match 2 commas next time. So instead of preserving 6 commas you'll eat them up and get a blank field.

    If this is not what you want then you have to do a lot more work. You have to make sure that it fails on all of the backtracks as well. This means that when you fail you need to remember the failure and fail every time you see that position. Like this:

    # Be sure that these start clean! my %bad_r; my %bad_pos; @x = split /(?{ # This populates $^R pos() })some pattern(??{ # This returns an impossible pattern for long matches if (pos()-$^R < 3 and not $bad_r{$^R} and not $bad_pos{pos()} ) { ""; } else { $bad_r{$^R}++; $bad_pos{ pos() }++; "no\\bmatch" } })/, $string;
    That is ugly! But it should work.

    Update: Well it should if I had not missed the closing paren on the pos(). Thanks to ikegami for catching that.

      Nice. The first snippet is exactly what I would have done, except I'd replace
      (??{ (pos()-$^R < 5) ? "" : "no\\bmatch" })
      with
      (?(?{ pos()-$^R >= 5 })(?!))
      to save from compiling patterns repeatedly.

        I missed that possibility in the documentation. That would be more efficient.
      Thanks & very clever (probably too clever for my little brain ;)- I'm trying though to see whether I can use case #1, rather than your much hairier second example.

      I actually want to preserve the field and keep it with the split, so prior to adding the length limitation, I was using:

      split /(?=$regex)/, $string
      And I want to get only the greediest match (that satisfies the length condition). So, does that mean I can't use first alternative then?
        You can't use the first version.

        You can use the second version like you did before. Just put the (?=) around the whole thing and insert whatever you want for the pattern in the middle.

        It is ugly but conceptually is not that bad. The first code pattern stores the position of the start of the match in $^R. The second one looks at the current position and the start (which is in $^R) and decides whether or not to make the match fail by interpolating in something that can't match. There are some complication around the logic, but that doesn't need to change.

Re: Can you do "conjunctive" (overlapping) conditions in a single regexp?
by ikegami (Patriarch) on Dec 18, 2008 at 02:58 UTC
    local our $limit = 10; /($re)(?(?{ length($^N) > $limit })(?!))/

    It will even backtrack until the limit isn't exceeded unless you do something like

    /((?>$re))(?(?{ length($^N) > $limit })(?!))/

    The features aren't nearly as experimental as implied. After all, they survived three major version (5.6, 5.8 and 5.10)! $^N requires 5.8, but you can use $1 if you want compatibility with 5.6.

    Update: Oops, forgot that this needed to be for split. Well, the comment on the experimental features still applies.

Re: Can you do "conjunctive" (overlapping) conditions in a single regexp?
by JadeNB (Chaplain) on Dec 18, 2008 at 21:48 UTC
    It's not an answer directly to the question that you asked, but couldn't you manually massage the results after splitting?
    my @clumsy_split = split /($re)/, $string; my @split; my $flag = 0; for my $i ( 0 .. $#clumsy_split ) { my $field = $clumsy_split[$i]; if ( $flag ) { $flag = 0; } elsif ( $i % 2 || length $field < $limit ) { push @split, $field; next; } else { $flag = 1; } $split[-1] .= $field; }
    This won't give exactly the same results as a hypothetical split ( /($re)/ && length $1 < $limit ), $string—for example, if $re is /a+/, $limit is 1, and $string is 'aa'—but maybe it's close enough.