ThreeMonks has asked for the wisdom of the Perl Monks concerning the following question:

Hi, monks. I have a vocabulary and sentence, and want to split the sentence into words in the vocabulary. For example,
my @vocabulary = qw(a abc abcd abd bc); my $sentence = 'abdaabc';
How to get the list (abd, a, abc)? I tried the following regexes. Each of them match the sentence, but none of them gives the right answer.
my $pattern = join '|', @vocabulary; $sentence =~ /^($pattern)+$/; print $1; my @list1; $sentence =~ /^(($pattern)(?{push @list1, $^N}))+$/; print (join ",", @list1), "\n"; use Regexp::DeferredExecution; my @list2; $sentence =~ /^(($pattern)(?{push @list2, $^N}))+$/; print (join ",", @list2), "\n"
In the first and the third cases, only the last matched word get printed, while in the second, a list much longer than expected.

Replies are listed 'Best First'.
Re: Split a sentence into words
by ikegami (Pope) on May 30, 2009 at 07:05 UTC

    Don't use my variables declared outside the regex pattern from within (?{}).

    The problem you are having is that one of the patterns matches, then gets added to @list1, then gets unmatched by backtracking. But you never remove it from @list1 on backtracking. A simple example of this:

    >perl -le"'abc1def2' =~ /(?:([a-z])(?{ print $^N }))+2/" a b c b c c d e f

    The solution is to use $^R.

    use strict; use warnings; my @vocabulary = qw( a abc abcd abd bc ); my $sentence = 'abdaabc'; my ($pattern) = map qr/$_/, join '|', map quotemeta, sort { length($b) <=> length($a) } # optional @vocabulary; use re 'eval'; local our @list; $sentence =~ / (?{ [] }) ^ (?: ($pattern) (?{ [ @{$^R}, $^N ] }) )+ $ (?{ @list = @{$^R} }) /x or die("No solution\n"); print( join('-', @list), "\n" ); # abd-a-abc

    Without the sort, you'd get abd-a-a-bc. If you want all possible solutions:

    ... use re 'eval'; local our @list; $sentence =~ / (?{ [] }) ^ (?: ($pattern) (?{ [ @{$^R}, $^N ] }) )+ $ (?{ push @list, join('-', @{$^R}) }) (?!) /x; die("No solution\n") if !@list; print("$_\n") for @list;
    abd-a-a-bc abd-a-abc
Re: Split a sentence into words
by akho (Hermit) on May 30, 2009 at 04:45 UTC
    my @vocabulary = qw(abd abcd abc a bc); my $sentence = 'abdaabc'; my $pattern = join '|', @vocabulary; my @words = $sentence =~ /($pattern)/g;

    note that @vocabulary has to be sorted in such a way that "longer" words come earlier; i.e. if word x is a prefix of word y, word y must come earlier in the list.

    Upd Does not actually work; i.e. it works only for some vocabularies; say (abcd, abc, de) will not split 'abcde' right. Things get complicated and computer-sciencey. See bart and ikegami's replies below.

      note that @vocabulary has to be sorted in such a way that "longer" words come earlier
      If you depend on a module like Regex::PreSuf, not only will it have the same effect, i.e. matching the longest match possible, but likely, it'll match faster, at least for longer lists, and pre 5.10 perl.
Re: Split a sentence into words
by bart (Canon) on May 30, 2009 at 12:24 UTC
    That reminds me of Abigail's (in)famous "determine if a number is prime using a regex" hack. Where I think that akho's Re: Split a sentence into words falls short, is that I think you ought to be using Perl's backtracking mechanism to make it fit anyway. It's not because parts match, that it'll match as a whole.

    Here is a dumb, straightforward approach:

    # your data my @vocabulary = qw(a abc abcd abd bc); my $sentence = 'abdaabc'; # regex for words my $re = join '|', @vocabulary; # does it match in *any* way my $success = $sentence =~ /^(?:$re)*$/; # show result print $success || 0;
    That displays a rather disappointing yet for a start, encouraging result:
    Okay, so it matches, but we have no idea how. We should find a way to mark where it matches, and where it backtracks.

    But after that, I'm lost. I've tried (??{code()) in various combinations, trying to capture intermediate state of the regex, including $^R, pos, @- and @+, but I don't get any intuitive results... maybe somebody else can take it up from here?

    The only other thing that yields interesting results, is

    use re 'debug';
    but it's not something you can use in a script.