in reply to Unicode substitution regex conundrum
"use utf8;" is needed if your source code is actually in UTF8 encoding. It does not affect regex matches.
Perl does not have UTF8 semantics, but instead Unicode semantics. The difference is that you work on *encodingless* strings in Perl, and that you use *normal* operators instead of separate ones. The important things are done internally.
Please read the Perl Unicode Tutorial and the Perl Unicode FAQ.
The following ought to suffice:
Unicode::Semantics works around a bug that causes the second half of latin1 to be ignored under certain circumstances.
use Encode qw(decode);
use Unicode::Semantics qw(up);
up($line = decode 'UTF-8', $line);
my $word = qr/\b(?!(?:AND|OR|XOR|NOT)\b)\w+/i;
$line =~ s/($word)\s*($word)/$1 AND $2/g for 1..2;