Here's an approach based on regexes that might be of interest. It has some problems (addressed in Update), and it might be better to pursue the design of a parser for this application. I don't think it should be much of a problem to compile Unicode regexes, although I must admit I have no experience in this area.
The problems I see (at the moment) are:
-
Stop word prefixes and suffixes are analyzed entirely independently, so that core in the (nonsense) word encoregent (last line of test text below) is not translated even though the prefix en and suffix gent are not associated with each other in the stop word list. Update: In fact, encore stops translation of core.
(Actually, now I think of it, this aspect of the regex could probably be fixed fairly easily, but at the cost of even slower speed. Maybe update later... See Update.)
-
A word like cores (last line of test text) must be specified separately for translation, with its own stop word list. (Uncomment the specification for cores in the @trans array to see the effect.)
-
A certain amount of confusion is introduced by the fact that the test text uses spaces to delimit words, and spaces, if I correctly understand the OP, would be essentially absent in the actual text. However, I couldn't get my mind around a total absence of whitespace!
-
I wouldn't be surprised to find that this approach is as slow as molasses in January (hey...!) for translating hundreds of thousands or millions of words.
-
If there is a very large number of words (and stop words) to translate, the final regex ($translate below) will be monstrous.
-
Probably some more I (and others) will think of later...
Anyhoo, FWIW, here it is.
use warnings;
use strict;
use List::MoreUtils qw(uniq);
my @trans = (
# substitute... for... except in...
[ 'TWENTY', 'score', qw(fourscore scored scores) ],
[ 'CENTER', 'core', qw(encore encores coregent score) ],
# [ 'CENTERS', 'cores', qw(encores scores) ],
[ 'JOHN', 'Johann', qw(Johannesburg) ],
[ 'CENTER', 'centre', ],
[ 'TRAVELED', 'travelled', ],
[ 'HAS NOT', 'hasn\'t', ],
);
my %trans = map @{ $_ }[1, 0], @trans;
my ($translate) =
map qr{ $_ }xms,
join ' | ',
map word_regex(@{ $_ }[1 .. $#{$_}]),
@trans
;
while (defined(my $line = <DATA>)) {
print $line;
$line =~ s{ ($translate) }{$trans{$1}}xmsg;
print $line;
print "\n";
}
sub word_regex {
my ($word,
@stops,
) = @_;
my $not_prefix = # 5. conjunction
join ') (?<! ', # 4. neg. look-behind...
uniq # 3. no dups...
map m{ \A (.+) $word }xms, # 2. extract prefixes...
@stops # 1. for any stop words...
;
$not_prefix = "(?<! $not_prefix)" if $not_prefix; # final wrap
my $not_suffix =
join ' | ',
uniq
map m{ $word (.+) \z }xms,
@stops
;
$not_suffix = "(?! $not_suffix)" if $not_suffix;
return qr{ $not_prefix $word $not_suffix }xms;
}
__DATA__
the core of the coregents encores scored fourscore score of
scores when Johann travelled to Johannesburg
for a score of cores hasn't a centre encoregent
Output:
>perl selective_trans_1.pl
the core of the coregents encores scored fourscore score of
the CENTER of the coregents encores scored fourscore TWENTY of
scores when Johann travelled to Johannesburg
scores when JOHN TRAVELED to Johannesburg
for a score of cores hasn't a centre encoregent
for a TWENTY of cores HAS NOT a CENTER encoregent
Update: A more-better version (replaces previous, less-better version). It addresses item 1 in the problem list above, and sorting by size as Polyglot suggests (apparently per tye) addresses item 2. It demonstrates the use of regexes in specifying the stop word list. (And expanding the function definitions inline would also produce the one-liner Polyglot wants. Win-win!) Although not what Polyglot wants or needs, I think this is a neat approach. I enjoyed working this problem, so thanks to Polyglot.
use warnings;
use strict;
my @translate = (
# insert... for... except in...
[ 'TWENTY', 'score', qw(twoscore unscored? score[srd]) ],
[ 'CENTER', 'core', qw(encore[sd]? score[sd]? core[srd]) ],
[ 'CENTERS', 'cores', qw(encores scores) ],
[ 'JOHN', 'Johann', qw(Johannesburg) ],
[ 'CENTER', 'centre', ],
[ 'TRAVELED', 'travelled', ],
[ 'HAS NOT', 'hasn\'t', ],
);
my %replace = map @{ $_ }[1, 0], @translate;
my $search =
join ' | ',
map word_regex(@{ $_ }[1 .. $#{$_}]),
sort { $b->[1] cmp $a->[1] } # longest words first
@translate
;
while (defined(my $line = <DATA>)) {
print $line;
$line =~ s{ ($search) }{$replace{$1}}xmsg;
print $line;
print "\n";
}
sub word_regex {
my ($word,
@stops,
) = @_;
my $not_stopped =
join ' ',
map not_stopped(@$_),
map [ m{ \A (.*) ($word) (.*) \z }xms ],
@stops
;
return "$not_stopped $word";
}
sub not_stopped {
my ($stop_prefix, # always defined if word defined, maybe empty
$word, # word embedded in stop word
$stop_suffix, # always defined if word defined, maybe empty
) = @_;
return '' unless defined $word and length $word;
# convert word to placeholder (faster match?)
$word = sprintf '.{%d}', length $word;
# convert stop prefix, if any, to POSITIVE assertion.
$stop_prefix = "(?<= $stop_prefix)" if length $stop_prefix;
# NEGATIVE assert of stop prefix, word placeholder, stop suffix.
return "(?! $stop_prefix $word $stop_suffix)";
}
__DATA__
a score and twoscore of unscored scorer won't unscore scored scores
at the core of the encore that was scored and cored
for many scores of encores Johann travelled to Johannesburg
for a score of cores hasn't a stercorean centre
Output:
>perl selective_trans_1.pl
a score and twoscore of unscored scorer won't unscore scored scores
a TWENTY and twoscore of unscored scorer won't unscore scored scores
at the core of the encore that was scored and cored
at the CENTER of the encore that was scored and cored
for many scores of encores Johann travelled to Johannesburg
for many scores of encores JOHN TRAVELED to Johannesburg
for a score of cores hasn't a stercorean centre
for a TWENTY of CENTERS HAS NOT a sterCENTERan CENTER
-
Are you posting in the right place? Check out Where do I post X? to know for sure.
-
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big>
<blockquote> <br /> <dd>
<dl> <dt> <em> <font>
<h1> <h2> <h3> <h4>
<h5> <h6> <hr /> <i>
<li> <nbsp> <ol> <p>
<small> <strike> <strong>
<sub> <sup> <table>
<td> <th> <tr> <tt>
<u> <ul>
-
Snippets of code should be wrapped in
<code> tags not
<pre> tags. In fact, <pre>
tags should generally be avoided. If they must
be used, extreme care should be
taken to ensure that their contents do not
have long lines (<70 chars), in order to prevent
horizontal scrolling (and possible janitor
intervention).
-
Want more info? How to link
or How to display code and escape characters
are good places to start.