Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?

Re: Efficient selective substitution on list of words

by graff (Chancellor)
on Jan 31, 2010 at 17:00 UTC ( #820594=note: print w/replies, xml ) Need Help??

in reply to Efficient selective substitution on list of words

Deepest apologies for having skipped over the part of the OP that BrowserUK has considerately placed into focus for me.

Now that I understand it correctly, I try again in a separate reply.

Pardon me if I'm jumping to conclusions, but it seems like your notion of "stopwords" is really just a matter making sure that the "word" string is not part of a larger word. If that's really all it amounts to, all you need is to put the \b assertion around each word:

my %edits = ( score => 'twenty', core => 'center', centre => 'center', centres => 'centers', travelled => 'traveled', "hasn't" => 'has not', Johann => 'John' ); my $pattern = '\b(' . join( '|', keys %edits ) . ')\b'; while (<DATA>) { s/$pattern/$edits{$1}/g; print; } __DATA__ fourscore and score years ago, we scored great scores with apple cores. it's time for an encore at the core of our cultural centre. in many centres where we travelled, Johann hasn't scored as well as he did in Johannesburg, where his score against Johannes Brahms shook us to our cores.
The example data there points out a couple issues you may need to cope with using this approach:
  • spelling changes (e.g. "centre" to "center") will need to be specified for all inflected/derived forms ("centres", "centred", "centring") due to the use of the \b assertions
  • some replacements will be inappropriate due to ambiguous usage (e.g. "score" may be used in a context where it does not mean "twenty")
  • some replacements might produce awkward results (e.g. "core centre" becomes "center center") -- maybe that's a stretch, but it's relevant to the example that you provided.

But depending on the actual set of replacements you need to do, those issues are likely to be less bothersome than the problem of trying to figure out all the "stopwords" you would need to specify in order to avoid incorrect replacements within larger words.

In any case, the exercise as a whole really should be "previewed" or "monitored": for a given set of replacements and input data, get a listing of all the matches in the data, and/or review all changes applied by the process, to confirm that all changes are as intended. If you really are dealing with "natural language" data here, it pays to be really careful.

Replies are listed 'Best First'.
Re^2: Efficient selective substitution on list of words
by BrowserUk (Pope) on Jan 31, 2010 at 17:15 UTC
    The target language is Asian, where 1) there are no spaces between words;

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://820594]
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others musing on the Monastery: (6)
As of 2017-10-23 01:44 GMT
Find Nodes?
    Voting Booth?
    My fridge is mostly full of:

    Results (276 votes). Check out past polls.