Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

Efficient selective substitution on list of words

by Polyglot (Monk)
on Jan 31, 2010 at 02:48 UTC ( #820537=perlquestion: print w/ replies, xml ) Need Help??
Polyglot has asked for the wisdom of the Perl Monks concerning the following question:

Greetings to all,

I asked in the chat window several days ago about how to accomplish this, and tye provided me a good answer using map and sort. Unfortunately, my laptop crashed shortly thereafter, and I lost his answer. (That'll teach me, ha!) However, there are a couple of complicating factors that tye may not have addressed even then, and I'm looking for wisdom on a succinct and safe way of accomplishing this.

Here's what I have:

  1. A file containing a tab-delimited list of words to exchange for modern spellings/equivalents, followed by a third column for any stopwords which should not have substitutions done in them.
  2. A file containing a list of files in which substitutions must be made.
  3. Over a hundred such files needing to be updated.
  4. The target language is Asian, where 1) there are no spaces between words; and 2) the encoding will be UTF-8. (This is significant, because any regexp must be sensitive to this, or it will fail.)
Here's an "English-ised" example of the words list file:

WORDREPLACEMENTSTOPWORDS
scoretwentyfourscore,scored,scores
corecenterencore,coregent
centrecenter 
travelledtraveled 
hasn'thas not 
JohannJohnJohannesburg

So, what I need to do is substitute each word in the first column for the word(s) in the second column, except where the word in the stopwords column is matched. While this seems like a simple scenario, I'm struggling to wrap my brain around it. I'm just beginning to grasp the concepts of map and join, and their syntax, but would much appreciate some ideas for how to accomplish this.

Blessings,

~Polyglot~

Comment on Efficient selective substitution on list of words
Re: Efficient selective substitution on list of words
by repellent (Priest) on Jan 31, 2010 at 05:56 UTC
    my %words = ( score => { replacement => 'twenty', stopwords => { fourscore => 1, scored => 1, scores => 1, }, }, core => { replacement => 'center', stopwords => { encore => 1, coreagent => 1, }, }, centre => { replacement => 'center', stopwords => { }, }, ... );

    If you have the data structure above, you may have enough to construct an efficient logic needed to achieve your task.
Re: Efficient selective substitution on list of words
by AnomalousMonk (Monsignor) on Jan 31, 2010 at 08:18 UTC

    Here's an approach based on regexes that might be of interest. It has some problems (addressed in Update), and it might be better to pursue the design of a parser for this application. I don't think it should be much of a problem to compile Unicode regexes, although I must admit I have no experience in this area.

    The problems I see (at the moment) are:

    1. Stop word prefixes and suffixes are analyzed entirely independently, so that core in the (nonsense) word encoregent (last line of test text below) is not translated even though the prefix en and suffix gent are not associated with each other in the stop word list. Update: In fact, encore stops translation of core. (Actually, now I think of it, this aspect of the regex could probably be fixed fairly easily, but at the cost of even slower speed. Maybe update later... See Update.)
    2. A word like cores (last line of test text) must be specified separately for translation, with its own stop word list. (Uncomment the specification for cores in the @trans array to see the effect.)
    3. A certain amount of confusion is introduced by the fact that the test text uses spaces to delimit words, and spaces, if I correctly understand the OP, would be essentially absent in the actual text. However, I couldn't get my mind around a total absence of whitespace!
    4. I wouldn't be surprised to find that this approach is as slow as molasses in January (hey...!) for translating hundreds of thousands or millions of words.
    5. If there is a very large number of words (and stop words) to translate, the final regex ($translate below) will be monstrous.
    6. Probably some more I (and others) will think of later...

    Anyhoo, FWIW, here it is.

    use warnings; use strict; use List::MoreUtils qw(uniq); my @trans = ( # substitute... for... except in... [ 'TWENTY', 'score', qw(fourscore scored scores) ], [ 'CENTER', 'core', qw(encore encores coregent score) ], # [ 'CENTERS', 'cores', qw(encores scores) ], [ 'JOHN', 'Johann', qw(Johannesburg) ], [ 'CENTER', 'centre', ], [ 'TRAVELED', 'travelled', ], [ 'HAS NOT', 'hasn\'t', ], ); my %trans = map @{ $_ }[1, 0], @trans; my ($translate) = map qr{ $_ }xms, join ' | ', map word_regex(@{ $_ }[1 .. $#{$_}]), @trans ; while (defined(my $line = <DATA>)) { print $line; $line =~ s{ ($translate) }{$trans{$1}}xmsg; print $line; print "\n"; } sub word_regex { my ($word, @stops, ) = @_; my $not_prefix = # 5. conjunction join ') (?<! ', # 4. neg. look-behind... uniq # 3. no dups... map m{ \A (.+) $word }xms, # 2. extract prefixes... @stops # 1. for any stop words... ; $not_prefix = "(?<! $not_prefix)" if $not_prefix; # final wrap my $not_suffix = join ' | ', uniq map m{ $word (.+) \z }xms, @stops ; $not_suffix = "(?! $not_suffix)" if $not_suffix; return qr{ $not_prefix $word $not_suffix }xms; } __DATA__ the core of the coregents encores scored fourscore score of scores when Johann travelled to Johannesburg for a score of cores hasn't a centre encoregent

    Output:

    >perl selective_trans_1.pl the core of the coregents encores scored fourscore score of the CENTER of the coregents encores scored fourscore TWENTY of scores when Johann travelled to Johannesburg scores when JOHN TRAVELED to Johannesburg for a score of cores hasn't a centre encoregent for a TWENTY of cores HAS NOT a CENTER encoregent

    Update: A more-better version (replaces previous, less-better version). It addresses item 1 in the problem list above, and sorting by size as Polyglot suggests (apparently per tye) addresses item 2. It demonstrates the use of regexes in specifying the stop word list. (And expanding the function definitions inline would also produce the one-liner Polyglot wants. Win-win!) Although not what Polyglot wants or needs, I think this is a neat approach. I enjoyed working this problem, so thanks to Polyglot.

    use warnings; use strict; my @translate = ( # insert... for... except in... [ 'TWENTY', 'score', qw(twoscore unscored? score[srd]) ], [ 'CENTER', 'core', qw(encore[sd]? score[sd]? core[srd]) ], [ 'CENTERS', 'cores', qw(encores scores) ], [ 'JOHN', 'Johann', qw(Johannesburg) ], [ 'CENTER', 'centre', ], [ 'TRAVELED', 'travelled', ], [ 'HAS NOT', 'hasn\'t', ], ); my %replace = map @{ $_ }[1, 0], @translate; my $search = join ' | ', map word_regex(@{ $_ }[1 .. $#{$_}]), sort { $b->[1] cmp $a->[1] } # longest words first @translate ; while (defined(my $line = <DATA>)) { print $line; $line =~ s{ ($search) }{$replace{$1}}xmsg; print $line; print "\n"; } sub word_regex { my ($word, @stops, ) = @_; my $not_stopped = join ' ', map not_stopped(@$_), map [ m{ \A (.*) ($word) (.*) \z }xms ], @stops ; return "$not_stopped $word"; } sub not_stopped { my ($stop_prefix, # always defined if word defined, maybe empty $word, # word embedded in stop word $stop_suffix, # always defined if word defined, maybe empty ) = @_; return '' unless defined $word and length $word; # convert word to placeholder (faster match?) $word = sprintf '.{%d}', length $word; # convert stop prefix, if any, to POSITIVE assertion. $stop_prefix = "(?<= $stop_prefix)" if length $stop_prefix; # NEGATIVE assert of stop prefix, word placeholder, stop suffix. return "(?! $stop_prefix $word $stop_suffix)"; } __DATA__ a score and twoscore of unscored scorer won't unscore scored scores at the core of the encore that was scored and cored for many scores of encores Johann travelled to Johannesburg for a score of cores hasn't a stercorean centre

    Output:

    >perl selective_trans_1.pl a score and twoscore of unscored scorer won't unscore scored scores a TWENTY and twoscore of unscored scorer won't unscore scored scores at the core of the encore that was scored and cored at the CENTER of the encore that was scored and cored for many scores of encores Johann travelled to Johannesburg for many scores of encores JOHN TRAVELED to Johannesburg for a score of cores hasn't a stercorean centre for a TWENTY of CENTERS HAS NOT a sterCENTERan CENTER
      I really do appreciate your suggestions, and while it may be more detailed overall, the charm of tye's solution was that he basically prepared the entire substitution list in a single line of code, using map, sort, and join with some interesting syntax and I was wanting to learn how it worked for the beauty of learning as much as for the joy of having a working solution.

      In the sort process, I remember he ordered the words by length so that the substitutions would start with the longest words and work down to the shorter ones. This would avoid several potential problems with overlapping terms in the exchange list, and is almost a necessity, I think, to proper function here.

      Blessings,

      ~Polyglot~

Re: Efficient selective substitution on list of words
by graff (Chancellor) on Jan 31, 2010 at 17:00 UTC
    Deepest apologies for having skipped over the part of the OP that BrowserUK has considerately placed into focus for me.

    Now that I understand it correctly, I try again in a separate reply.

    Pardon me if I'm jumping to conclusions, but it seems like your notion of "stopwords" is really just a matter making sure that the "word" string is not part of a larger word. If that's really all it amounts to, all you need is to put the \b assertion around each word:

    my %edits = ( score => 'twenty', core => 'center', centre => 'center', centres => 'centers', travelled => 'traveled', "hasn't" => 'has not', Johann => 'John' ); my $pattern = '\b(' . join( '|', keys %edits ) . ')\b'; while (<DATA>) { s/$pattern/$edits{$1}/g; print; } __DATA__ fourscore and score years ago, we scored great scores with apple cores. it's time for an encore at the core of our cultural centre. in many centres where we travelled, Johann hasn't scored as well as he did in Johannesburg, where his score against Johannes Brahms shook us to our cores.
    The example data there points out a couple issues you may need to cope with using this approach:
    • spelling changes (e.g. "centre" to "center") will need to be specified for all inflected/derived forms ("centres", "centred", "centring") due to the use of the \b assertions
    • some replacements will be inappropriate due to ambiguous usage (e.g. "score" may be used in a context where it does not mean "twenty")
    • some replacements might produce awkward results (e.g. "core centre" becomes "center center") -- maybe that's a stretch, but it's relevant to the example that you provided.

    But depending on the actual set of replacements you need to do, those issues are likely to be less bothersome than the problem of trying to figure out all the "stopwords" you would need to specify in order to avoid incorrect replacements within larger words.

    In any case, the exercise as a whole really should be "previewed" or "monitored": for a given set of replacements and input data, get a listing of all the matches in the data, and/or review all changes applied by the process, to confirm that all changes are as intended. If you really are dealing with "natural language" data here, it pays to be really careful.

      The target language is Asian, where 1) there are no spaces between words;
Re: Efficient selective substitution on list of words
by graff (Chancellor) on Jan 31, 2010 at 19:27 UTC
    The part I left intact in my previous (misguided) reply is still applicable: you need to be very careful about checking results of the edits, and it's likely that some manual review (what NLP folks call "human annotation") of the output will be necessary in any case. Finding or building a good user interface for efficient review of automated edits will be time well spent.

    The target language is Asian, where 1) there are no spaces between words...

    There's a small but potentially devilish detail if the text data being edited comes with line-breaks within sentences/paragraphs. If that's true for your data, do you know for certain whether or not any of the multi-character strings to edit might get split by a line break? (For languages that don't put spaces between words, when explicit line-breaks are used, they can happen anywhere, including the middle of a "linguistic" word.)

    2) the encoding will be UTF-8.

    This is simply a matter of making sure to use the appropriate IO layer discipline when reading and writing files. So long as all file handles are opened/set to "utf8", the regex stuff will take care of itself (character semantics will be used).

    The following approach doesn't deal with the possible issue of line-breaks in the data, so that's "left as an exercise" if it turns out to be an issue for you. I found that the "stopword" list for the dummy example core -> center needed to be "enhanced" so that it wouldn't misfire on tokens containing "score", and that sort of issue is something that will probably occupy some of your time.

    There's also a potential need to make sure that replacements are done in a specific order, e.g. if all "foo" must change to "bar", and all "baz" must change to "foo" (not to "bar"), you have to do the edits in that order. It's an easy thing to cope with, once you know enough about the data.

    Finally, given the limited (and possibly misleading) nature of the sample data (text and edit directives), there's a decent chance that the following approach won't actually work for your application.

    That said, the following uses the stop-lists to form patterns that match enough characters around the target word so that you can check whether any of the stop-words match.

    #!/usr/bin/perl use strict; use Data::Dumper qw/Dumper/; my $text = <<EOT; fourscore and score years ago, we scored great scores in apple cores. it's time for an encore at the core of our cultural centre. in many centres where we travelled, Johann hasn't scored as well as he did in Johannesburg, where his score against Johannes Brahms shook us to our cores. EOT my %edit; while (<DATA>) { chomp; my ( $word, $repl, $stops ) = split /\t/; next unless ( length( $word ) and length( $repl )); my ( $pref_len, $suff_len ) = ( 0, 0 ); my @stops = split( /,/, $stops ); for my $stop ( @stops ) { my ( $pref, $suff ) = map { length( $_ ) } split( /\Q$word\E/, + $stop ); $pref_len = $pref if ( $pref_len < $pref ); $suff_len = $suff if ( $suff_len < $suff ); } my $pattern = sprintf( ".{0,%d}%s.{0,%d}", $pref_len, $word, $suff +_len ); $edit{$pattern} = { word => $word, repl => $repl, stop => join( '|', @stops ) }; } for my $pattern ( keys %edit ) { while ( $text =~ /($pattern)/g ) { my $edited = my $source = $1; next if ( $edit{$pattern}{stop} and $edited =~ /(?:$edit{$patt +ern}{stop})/ ); $edited =~ s/$edit{$pattern}{word}/$edit{$pattern}{repl}/; $text =~ s/\Q$source\E/$edited/; } } print $text; __DATA__ score twenty fourscore,scored,scores core center encore,coregent,score centre center travelled traveled hasn't has not Johann John Johannesburg
    If any of your actual stop-word patterns happen to contain "regex-magic" characters, like ".?", they will be applied as such -- i.e. "a.?b" will match "ab" or "a.b" (any character in the middle), but will not work to match a literal period and question-mark surrounded by "a" and "b". I'm sure there's a way to enforce literal matches, but it might be tricky.

    (P.S.: When I pasted the source code into the posting text-box, I did try to make sure there were literal tabs in the DATA lines -- I hope it comes through that way on download.)

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://820537]
Approved by ww
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others rifling through the Monastery: (5)
As of 2014-07-26 16:25 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    My favorite superfluous repetitious redundant duplicative phrase is:









    Results (178 votes), past polls