Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Comment on

( #3333=superdoc: print w/ replies, xml ) Need Help??

Here's an approach based on regexes that might be of interest. It has some problems (addressed in Update), and it might be better to pursue the design of a parser for this application. I don't think it should be much of a problem to compile Unicode regexes, although I must admit I have no experience in this area.

The problems I see (at the moment) are:

  1. Stop word prefixes and suffixes are analyzed entirely independently, so that core in the (nonsense) word encoregent (last line of test text below) is not translated even though the prefix en and suffix gent are not associated with each other in the stop word list. Update: In fact, encore stops translation of core. (Actually, now I think of it, this aspect of the regex could probably be fixed fairly easily, but at the cost of even slower speed. Maybe update later... See Update.)
  2. A word like cores (last line of test text) must be specified separately for translation, with its own stop word list. (Uncomment the specification for cores in the @trans array to see the effect.)
  3. A certain amount of confusion is introduced by the fact that the test text uses spaces to delimit words, and spaces, if I correctly understand the OP, would be essentially absent in the actual text. However, I couldn't get my mind around a total absence of whitespace!
  4. I wouldn't be surprised to find that this approach is as slow as molasses in January (hey...!) for translating hundreds of thousands or millions of words.
  5. If there is a very large number of words (and stop words) to translate, the final regex ($translate below) will be monstrous.
  6. Probably some more I (and others) will think of later...

Anyhoo, FWIW, here it is.

use warnings; use strict; use List::MoreUtils qw(uniq); my @trans = ( # substitute... for... except in... [ 'TWENTY', 'score', qw(fourscore scored scores) ], [ 'CENTER', 'core', qw(encore encores coregent score) ], # [ 'CENTERS', 'cores', qw(encores scores) ], [ 'JOHN', 'Johann', qw(Johannesburg) ], [ 'CENTER', 'centre', ], [ 'TRAVELED', 'travelled', ], [ 'HAS NOT', 'hasn\'t', ], ); my %trans = map @{ $_ }[1, 0], @trans; my ($translate) = map qr{ $_ }xms, join ' | ', map word_regex(@{ $_ }[1 .. $#{$_}]), @trans ; while (defined(my $line = <DATA>)) { print $line; $line =~ s{ ($translate) }{$trans{$1}}xmsg; print $line; print "\n"; } sub word_regex { my ($word, @stops, ) = @_; my $not_prefix = # 5. conjunction join ') (?<! ', # 4. neg. look-behind... uniq # 3. no dups... map m{ \A (.+) $word }xms, # 2. extract prefixes... @stops # 1. for any stop words... ; $not_prefix = "(?<! $not_prefix)" if $not_prefix; # final wrap my $not_suffix = join ' | ', uniq map m{ $word (.+) \z }xms, @stops ; $not_suffix = "(?! $not_suffix)" if $not_suffix; return qr{ $not_prefix $word $not_suffix }xms; } __DATA__ the core of the coregents encores scored fourscore score of scores when Johann travelled to Johannesburg for a score of cores hasn't a centre encoregent

Output:

>perl selective_trans_1.pl the core of the coregents encores scored fourscore score of the CENTER of the coregents encores scored fourscore TWENTY of scores when Johann travelled to Johannesburg scores when JOHN TRAVELED to Johannesburg for a score of cores hasn't a centre encoregent for a TWENTY of cores HAS NOT a CENTER encoregent

Update: A more-better version (replaces previous, less-better version). It addresses item 1 in the problem list above, and sorting by size as Polyglot suggests (apparently per tye) addresses item 2. It demonstrates the use of regexes in specifying the stop word list. (And expanding the function definitions inline would also produce the one-liner Polyglot wants. Win-win!) Although not what Polyglot wants or needs, I think this is a neat approach. I enjoyed working this problem, so thanks to Polyglot.

use warnings; use strict; my @translate = ( # insert... for... except in... [ 'TWENTY', 'score', qw(twoscore unscored? score[srd]) ], [ 'CENTER', 'core', qw(encore[sd]? score[sd]? core[srd]) ], [ 'CENTERS', 'cores', qw(encores scores) ], [ 'JOHN', 'Johann', qw(Johannesburg) ], [ 'CENTER', 'centre', ], [ 'TRAVELED', 'travelled', ], [ 'HAS NOT', 'hasn\'t', ], ); my %replace = map @{ $_ }[1, 0], @translate; my $search = join ' | ', map word_regex(@{ $_ }[1 .. $#{$_}]), sort { $b->[1] cmp $a->[1] } # longest words first @translate ; while (defined(my $line = <DATA>)) { print $line; $line =~ s{ ($search) }{$replace{$1}}xmsg; print $line; print "\n"; } sub word_regex { my ($word, @stops, ) = @_; my $not_stopped = join ' ', map not_stopped(@$_), map [ m{ \A (.*) ($word) (.*) \z }xms ], @stops ; return "$not_stopped $word"; } sub not_stopped { my ($stop_prefix, # always defined if word defined, maybe empty $word, # word embedded in stop word $stop_suffix, # always defined if word defined, maybe empty ) = @_; return '' unless defined $word and length $word; # convert word to placeholder (faster match?) $word = sprintf '.{%d}', length $word; # convert stop prefix, if any, to POSITIVE assertion. $stop_prefix = "(?<= $stop_prefix)" if length $stop_prefix; # NEGATIVE assert of stop prefix, word placeholder, stop suffix. return "(?! $stop_prefix $word $stop_suffix)"; } __DATA__ a score and twoscore of unscored scorer won't unscore scored scores at the core of the encore that was scored and cored for many scores of encores Johann travelled to Johannesburg for a score of cores hasn't a stercorean centre

Output:

>perl selective_trans_1.pl a score and twoscore of unscored scorer won't unscore scored scores a TWENTY and twoscore of unscored scorer won't unscore scored scores at the core of the encore that was scored and cored at the CENTER of the encore that was scored and cored for many scores of encores Johann travelled to Johannesburg for many scores of encores JOHN TRAVELED to Johannesburg for a score of cores hasn't a stercorean centre for a TWENTY of CENTERS HAS NOT a sterCENTERan CENTER

In reply to Re: Efficient selective substitution on list of words by AnomalousMonk
in thread Efficient selective substitution on list of words by Polyglot

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • Outside of code tags, you may need to use entities for some characters:
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?
    Username:
    Password:

    What's my password?
    Create A New User
    Chatterbox?
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others chanting in the Monastery: (8)
    As of 2014-07-24 06:32 GMT
    Sections?
    Information?
    Find Nodes?
    Leftovers?
      Voting Booth?

      My favorite superfluous repetitious redundant duplicative phrase is:









      Results (158 votes), past polls