<?xml version="1.0" encoding="windows-1252"?>
<node id="491884" title="combining (?(condition)yes|no) and (?{code})" created="2005-09-14 10:35:06" updated="2005-09-14 06:35:06">
<type id="11">
note</type>
<author id="249603">
halley</author>
<data>
<field name="doctext">
&lt;p&gt;It took me a bit to get the syntax correct, but there are two special regular-expression features which can be used in combination.  &lt;c&gt;(?(condition)yes|no)&lt;/c&gt; and &lt;c&gt;(?{code})&lt;/c&gt;.  The perldoc [perldoc://perlre] page explains that they can be combined, but gives no example.  I then use the &lt;c&gt;(?!pattern)&lt;/c&gt; construct with no pattern to force a backtrack for each non-word.

&lt;p&gt;Here's my example.  

&lt;code&gt;
use strict;
use warnings;

my %vocab = map { $_ =&gt; 1 }
            qw/one two three four
               five six seven eight
               nine/;

my $text = "onetwoeightxfour";

my $finder = qr/
                (\w+?)
                (?(?{ not exists $vocab{$1} })
                  (?!) | (?=) )
               /x;

for ($text =~ m/$finder/g)
{
    print $_,$/;
}
&lt;/code&gt;
Output:
&lt;code&gt;
one
two
eight
four
&lt;/code&gt;

This particular solution is non-greedy:  it finds the shortest known word, and leaves the rest for future matches.  A more complicated solution would try harder to consume more letters for an early word if it led to fewer un-matched letters in the long-run:  "bekindtostewardessesplease" should find 'stewardesses', not 'stew'.  Luckily, one possible solution is simple:  change &lt;c&gt;(\w+?)&lt;/c&gt; to &lt;c&gt;(\w+)&lt;/c&gt;, and be patient with the engine as it chugs through the additional backtracking work.

&lt;p&gt;Of course, you can fill the vocabulary hash with whatever you want, or use different code in the &lt;c&gt;(?{code})&lt;/c&gt; construct to achieve the solution.  You can also replace the &lt;c&gt;(?=)&lt;/c&gt; success case to deal with extra unknown letters between words.

&lt;div class="pmsig"&gt;&lt;div class="pmsig-249603"&gt;
&lt;p&gt;--&lt;br&gt;&lt;tt&gt;&amp;#91; e d @ h a l l e y . c c &amp;#93;&lt;/tt&gt;

&lt;/div&gt;&lt;/div&gt;</field>
<field name="root_node">
491875</field>
<field name="parent_node">
491875</field>
</data>
</node>
