Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW

Re: Efficient selective substitution on list of words

by graff (Chancellor)
on Jan 31, 2010 at 19:27 UTC ( #820616=note: print w/replies, xml ) Need Help??

in reply to Efficient selective substitution on list of words

The part I left intact in my previous (misguided) reply is still applicable: you need to be very careful about checking results of the edits, and it's likely that some manual review (what NLP folks call "human annotation") of the output will be necessary in any case. Finding or building a good user interface for efficient review of automated edits will be time well spent.

The target language is Asian, where 1) there are no spaces between words...

There's a small but potentially devilish detail if the text data being edited comes with line-breaks within sentences/paragraphs. If that's true for your data, do you know for certain whether or not any of the multi-character strings to edit might get split by a line break? (For languages that don't put spaces between words, when explicit line-breaks are used, they can happen anywhere, including the middle of a "linguistic" word.)

2) the encoding will be UTF-8.

This is simply a matter of making sure to use the appropriate IO layer discipline when reading and writing files. So long as all file handles are opened/set to "utf8", the regex stuff will take care of itself (character semantics will be used).

The following approach doesn't deal with the possible issue of line-breaks in the data, so that's "left as an exercise" if it turns out to be an issue for you. I found that the "stopword" list for the dummy example core -> center needed to be "enhanced" so that it wouldn't misfire on tokens containing "score", and that sort of issue is something that will probably occupy some of your time.

There's also a potential need to make sure that replacements are done in a specific order, e.g. if all "foo" must change to "bar", and all "baz" must change to "foo" (not to "bar"), you have to do the edits in that order. It's an easy thing to cope with, once you know enough about the data.

Finally, given the limited (and possibly misleading) nature of the sample data (text and edit directives), there's a decent chance that the following approach won't actually work for your application.

That said, the following uses the stop-lists to form patterns that match enough characters around the target word so that you can check whether any of the stop-words match.

#!/usr/bin/perl use strict; use Data::Dumper qw/Dumper/; my $text = <<EOT; fourscore and score years ago, we scored great scores in apple cores. it's time for an encore at the core of our cultural centre. in many centres where we travelled, Johann hasn't scored as well as he did in Johannesburg, where his score against Johannes Brahms shook us to our cores. EOT my %edit; while (<DATA>) { chomp; my ( $word, $repl, $stops ) = split /\t/; next unless ( length( $word ) and length( $repl )); my ( $pref_len, $suff_len ) = ( 0, 0 ); my @stops = split( /,/, $stops ); for my $stop ( @stops ) { my ( $pref, $suff ) = map { length( $_ ) } split( /\Q$word\E/, + $stop ); $pref_len = $pref if ( $pref_len < $pref ); $suff_len = $suff if ( $suff_len < $suff ); } my $pattern = sprintf( ".{0,%d}%s.{0,%d}", $pref_len, $word, $suff +_len ); $edit{$pattern} = { word => $word, repl => $repl, stop => join( '|', @stops ) }; } for my $pattern ( keys %edit ) { while ( $text =~ /($pattern)/g ) { my $edited = my $source = $1; next if ( $edit{$pattern}{stop} and $edited =~ /(?:$edit{$patt +ern}{stop})/ ); $edited =~ s/$edit{$pattern}{word}/$edit{$pattern}{repl}/; $text =~ s/\Q$source\E/$edited/; } } print $text; __DATA__ score twenty fourscore,scored,scores core center encore,coregent,score centre center travelled traveled hasn't has not Johann John Johannesburg
If any of your actual stop-word patterns happen to contain "regex-magic" characters, like ".?", they will be applied as such -- i.e. "a.?b" will match "ab" or "a.b" (any character in the middle), but will not work to match a literal period and question-mark surrounded by "a" and "b". I'm sure there's a way to enforce literal matches, but it might be tricky.

(P.S.: When I pasted the source code into the posting text-box, I did try to make sure there were literal tabs in the DATA lines -- I hope it comes through that way on download.)

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://820616]
[Corion]: :-D
Corion discovers a new shiny toy to try out over the (longish) weekend. Since I've done some more with websockets, maybe I'll try writing a webserver that implements hot-reloading of HTML(+CSS, +Javascript) in the browser. Edit the local file and ...
[Corion]: ... the browser(s) get a ping to a) refresh the page or b) reload "just" the changed parts, keeping the scroll position etc.
[Corion]: But I also have to look at how I can make WWW::Mechanize:: RemoteBrowser a reality, and how to make it safe from malicious content ;)

How do I use this? | Other CB clients
Other Users?
Others studying the Monastery: (8)
As of 2018-04-26 10:45 GMT
Find Nodes?
    Voting Booth?