Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

Re: Efficient selective substitution on list of words

by AnomalousMonk (Abbot)
on Jan 31, 2010 at 08:18 UTC ( #820558=note: print w/ replies, xml ) Need Help??


in reply to Efficient selective substitution on list of words

Here's an approach based on regexes that might be of interest. It has some problems (addressed in Update), and it might be better to pursue the design of a parser for this application. I don't think it should be much of a problem to compile Unicode regexes, although I must admit I have no experience in this area.

The problems I see (at the moment) are:

  1. Stop word prefixes and suffixes are analyzed entirely independently, so that core in the (nonsense) word encoregent (last line of test text below) is not translated even though the prefix en and suffix gent are not associated with each other in the stop word list. Update: In fact, encore stops translation of core. (Actually, now I think of it, this aspect of the regex could probably be fixed fairly easily, but at the cost of even slower speed. Maybe update later... See Update.)
  2. A word like cores (last line of test text) must be specified separately for translation, with its own stop word list. (Uncomment the specification for cores in the @trans array to see the effect.)
  3. A certain amount of confusion is introduced by the fact that the test text uses spaces to delimit words, and spaces, if I correctly understand the OP, would be essentially absent in the actual text. However, I couldn't get my mind around a total absence of whitespace!
  4. I wouldn't be surprised to find that this approach is as slow as molasses in January (hey...!) for translating hundreds of thousands or millions of words.
  5. If there is a very large number of words (and stop words) to translate, the final regex ($translate below) will be monstrous.
  6. Probably some more I (and others) will think of later...

Anyhoo, FWIW, here it is.

use warnings; use strict; use List::MoreUtils qw(uniq); my @trans = ( # substitute... for... except in... [ 'TWENTY', 'score', qw(fourscore scored scores) ], [ 'CENTER', 'core', qw(encore encores coregent score) ], # [ 'CENTERS', 'cores', qw(encores scores) ], [ 'JOHN', 'Johann', qw(Johannesburg) ], [ 'CENTER', 'centre', ], [ 'TRAVELED', 'travelled', ], [ 'HAS NOT', 'hasn\'t', ], ); my %trans = map @{ $_ }[1, 0], @trans; my ($translate) = map qr{ $_ }xms, join ' | ', map word_regex(@{ $_ }[1 .. $#{$_}]), @trans ; while (defined(my $line = <DATA>)) { print $line; $line =~ s{ ($translate) }{$trans{$1}}xmsg; print $line; print "\n"; } sub word_regex { my ($word, @stops, ) = @_; my $not_prefix = # 5. conjunction join ') (?<! ', # 4. neg. look-behind... uniq # 3. no dups... map m{ \A (.+) $word }xms, # 2. extract prefixes... @stops # 1. for any stop words... ; $not_prefix = "(?<! $not_prefix)" if $not_prefix; # final wrap my $not_suffix = join ' | ', uniq map m{ $word (.+) \z }xms, @stops ; $not_suffix = "(?! $not_suffix)" if $not_suffix; return qr{ $not_prefix $word $not_suffix }xms; } __DATA__ the core of the coregents encores scored fourscore score of scores when Johann travelled to Johannesburg for a score of cores hasn't a centre encoregent

Output:

>perl selective_trans_1.pl the core of the coregents encores scored fourscore score of the CENTER of the coregents encores scored fourscore TWENTY of scores when Johann travelled to Johannesburg scores when JOHN TRAVELED to Johannesburg for a score of cores hasn't a centre encoregent for a TWENTY of cores HAS NOT a CENTER encoregent

Update: A more-better version (replaces previous, less-better version). It addresses item 1 in the problem list above, and sorting by size as Polyglot suggests (apparently per tye) addresses item 2. It demonstrates the use of regexes in specifying the stop word list. (And expanding the function definitions inline would also produce the one-liner Polyglot wants. Win-win!) Although not what Polyglot wants or needs, I think this is a neat approach. I enjoyed working this problem, so thanks to Polyglot.

use warnings; use strict; my @translate = ( # insert... for... except in... [ 'TWENTY', 'score', qw(twoscore unscored? score[srd]) ], [ 'CENTER', 'core', qw(encore[sd]? score[sd]? core[srd]) ], [ 'CENTERS', 'cores', qw(encores scores) ], [ 'JOHN', 'Johann', qw(Johannesburg) ], [ 'CENTER', 'centre', ], [ 'TRAVELED', 'travelled', ], [ 'HAS NOT', 'hasn\'t', ], ); my %replace = map @{ $_ }[1, 0], @translate; my $search = join ' | ', map word_regex(@{ $_ }[1 .. $#{$_}]), sort { $b->[1] cmp $a->[1] } # longest words first @translate ; while (defined(my $line = <DATA>)) { print $line; $line =~ s{ ($search) }{$replace{$1}}xmsg; print $line; print "\n"; } sub word_regex { my ($word, @stops, ) = @_; my $not_stopped = join ' ', map not_stopped(@$_), map [ m{ \A (.*) ($word) (.*) \z }xms ], @stops ; return "$not_stopped $word"; } sub not_stopped { my ($stop_prefix, # always defined if word defined, maybe empty $word, # word embedded in stop word $stop_suffix, # always defined if word defined, maybe empty ) = @_; return '' unless defined $word and length $word; # convert word to placeholder (faster match?) $word = sprintf '.{%d}', length $word; # convert stop prefix, if any, to POSITIVE assertion. $stop_prefix = "(?<= $stop_prefix)" if length $stop_prefix; # NEGATIVE assert of stop prefix, word placeholder, stop suffix. return "(?! $stop_prefix $word $stop_suffix)"; } __DATA__ a score and twoscore of unscored scorer won't unscore scored scores at the core of the encore that was scored and cored for many scores of encores Johann travelled to Johannesburg for a score of cores hasn't a stercorean centre

Output:

>perl selective_trans_1.pl a score and twoscore of unscored scorer won't unscore scored scores a TWENTY and twoscore of unscored scorer won't unscore scored scores at the core of the encore that was scored and cored at the CENTER of the encore that was scored and cored for many scores of encores Johann travelled to Johannesburg for many scores of encores JOHN TRAVELED to Johannesburg for a score of cores hasn't a stercorean centre for a TWENTY of CENTERS HAS NOT a sterCENTERan CENTER


Comment on Re: Efficient selective substitution on list of words
Select or Download Code
Re^2: Efficient selective substitution on list of words
by Polyglot (Monk) on Jan 31, 2010 at 13:56 UTC
    I really do appreciate your suggestions, and while it may be more detailed overall, the charm of tye's solution was that he basically prepared the entire substitution list in a single line of code, using map, sort, and join with some interesting syntax and I was wanting to learn how it worked for the beauty of learning as much as for the joy of having a working solution.

    In the sort process, I remember he ordered the words by length so that the substitutions would start with the longest words and work down to the shorter ones. This would avoid several potential problems with overlapping terms in the exchange list, and is almost a necessity, I think, to proper function here.

    Blessings,

    ~Polyglot~

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://820558]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others wandering the Monastery: (6)
As of 2014-09-22 07:34 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    How do you remember the number of days in each month?











    Results (182 votes), past polls