Re: Efficient selective substitution on list of wordsby graff (Chancellor)
|on Jan 31, 2010 at 19:27 UTC||Need Help??|
The part I left intact in my previous (misguided) reply is still applicable: you need to be very careful about checking results of the edits, and it's likely that some manual review (what NLP folks call "human annotation") of the output will be necessary in any case. Finding or building a good user interface for efficient review of automated edits will be time well spent.
The target language is Asian, where 1) there are no spaces between words...
There's a small but potentially devilish detail if the text data being edited comes with line-breaks within sentences/paragraphs. If that's true for your data, do you know for certain whether or not any of the multi-character strings to edit might get split by a line break? (For languages that don't put spaces between words, when explicit line-breaks are used, they can happen anywhere, including the middle of a "linguistic" word.)
2) the encoding will be UTF-8.
This is simply a matter of making sure to use the appropriate IO layer discipline when reading and writing files. So long as all file handles are opened/set to "utf8", the regex stuff will take care of itself (character semantics will be used).
The following approach doesn't deal with the possible issue of line-breaks in the data, so that's "left as an exercise" if it turns out to be an issue for you. I found that the "stopword" list for the dummy example core -> center needed to be "enhanced" so that it wouldn't misfire on tokens containing "score", and that sort of issue is something that will probably occupy some of your time.
There's also a potential need to make sure that replacements are done in a specific order, e.g. if all "foo" must change to "bar", and all "baz" must change to "foo" (not to "bar"), you have to do the edits in that order. It's an easy thing to cope with, once you know enough about the data.
Finally, given the limited (and possibly misleading) nature of the sample data (text and edit directives), there's a decent chance that the following approach won't actually work for your application.
That said, the following uses the stop-lists to form patterns that match enough characters around the target word so that you can check whether any of the stop-words match.
If any of your actual stop-word patterns happen to contain "regex-magic" characters, like ".?", they will be applied as such -- i.e. "a.?b" will match "ab" or "a.b" (any character in the middle), but will not work to match a literal period and question-mark surrounded by "a" and "b". I'm sure there's a way to enforce literal matches, but it might be tricky.
(P.S.: When I pasted the source code into the posting text-box, I did try to make sure there were literal tabs in the DATA lines -- I hope it comes through that way on download.)