Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

Split a sentence into words

by ThreeMonks (Novice)
on May 30, 2009 at 04:33 UTC ( #767007=perlquestion: print w/ replies, xml ) Need Help??
ThreeMonks has asked for the wisdom of the Perl Monks concerning the following question:

Hi, monks. I have a vocabulary and sentence, and want to split the sentence into words in the vocabulary. For example,
my @vocabulary = qw(a abc abcd abd bc); my $sentence = 'abdaabc';
How to get the list (abd, a, abc)? I tried the following regexes. Each of them match the sentence, but none of them gives the right answer.
my $pattern = join '|', @vocabulary; $sentence =~ /^($pattern)+$/; print $1; my @list1; $sentence =~ /^(($pattern)(?{push @list1, $^N}))+$/; print (join ",", @list1), "\n"; use Regexp::DeferredExecution; my @list2; $sentence =~ /^(($pattern)(?{push @list2, $^N}))+$/; print (join ",", @list2), "\n"
In the first and the third cases, only the last matched word get printed, while in the second, a list much longer than expected.

Comment on Split a sentence into words
Select or Download Code
Re: Split a sentence into words
by akho (Hermit) on May 30, 2009 at 04:45 UTC
    my @vocabulary = qw(abd abcd abc a bc); my $sentence = 'abdaabc'; my $pattern = join '|', @vocabulary; my @words = $sentence =~ /($pattern)/g;

    note that @vocabulary has to be sorted in such a way that "longer" words come earlier; i.e. if word x is a prefix of word y, word y must come earlier in the list.

    Upd Does not actually work; i.e. it works only for some vocabularies; say (abcd, abc, de) will not split 'abcde' right. Things get complicated and computer-sciencey. See bart and ikegami's replies below.

      note that @vocabulary has to be sorted in such a way that "longer" words come earlier
      If you depend on a module like Regex::PreSuf, not only will it have the same effect, i.e. matching the longest match possible, but likely, it'll match faster, at least for longer lists, and pre 5.10 perl.
Re: Split a sentence into words
by ikegami (Pope) on May 30, 2009 at 07:05 UTC

    Don't use my variables declared outside the regex pattern from within (?{}).

    The problem you are having is that one of the patterns matches, then gets added to @list1, then gets unmatched by backtracking. But you never remove it from @list1 on backtracking. A simple example of this:

    >perl -le"'abc1def2' =~ /(?:([a-z])(?{ print $^N }))+2/" a b c b c c d e f

    The solution is to use $^R.

    use strict; use warnings; my @vocabulary = qw( a abc abcd abd bc ); my $sentence = 'abdaabc'; my ($pattern) = map qr/$_/, join '|', map quotemeta, sort { length($b) <=> length($a) } # optional @vocabulary; use re 'eval'; local our @list; $sentence =~ / (?{ [] }) ^ (?: ($pattern) (?{ [ @{$^R}, $^N ] }) )+ $ (?{ @list = @{$^R} }) /x or die("No solution\n"); print( join('-', @list), "\n" ); # abd-a-abc

    Without the sort, you'd get abd-a-a-bc. If you want all possible solutions:

    ... use re 'eval'; local our @list; $sentence =~ / (?{ [] }) ^ (?: ($pattern) (?{ [ @{$^R}, $^N ] }) )+ $ (?{ push @list, join('-', @{$^R}) }) (?!) /x; die("No solution\n") if !@list; print("$_\n") for @list;
    abd-a-a-bc abd-a-abc
Re: Split a sentence into words
by bart (Canon) on May 30, 2009 at 12:24 UTC
    That reminds me of Abigail's (in)famous "determine if a number is prime using a regex" hack. Where I think that akho's Re: Split a sentence into words falls short, is that I think you ought to be using Perl's backtracking mechanism to make it fit anyway. It's not because parts match, that it'll match as a whole.

    Here is a dumb, straightforward approach:

    # your data my @vocabulary = qw(a abc abcd abd bc); my $sentence = 'abdaabc'; # regex for words my $re = join '|', @vocabulary; # does it match in *any* way my $success = $sentence =~ /^(?:$re)*$/; # show result print $success || 0;
    That displays a rather disappointing yet for a start, encouraging result:
    1
    Okay, so it matches, but we have no idea how. We should find a way to mark where it matches, and where it backtracks.

    But after that, I'm lost. I've tried (??{code()) in various combinations, trying to capture intermediate state of the regex, including $^R, pos, @- and @+, but I don't get any intuitive results... maybe somebody else can take it up from here?

    The only other thing that yields interesting results, is

    use re 'debug';
    but it's not something you can use in a script.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://767007]
Approved by akho
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others romping around the Monastery: (12)
As of 2015-07-06 17:24 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    The top three priorities of my open tasks are (in descending order of likelihood to be worked on) ...









    Results (77 votes), past polls