Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options

Comment on

( #3333=superdoc: print w/replies, xml ) Need Help??
The part I left intact in my previous (misguided) reply is still applicable: you need to be very careful about checking results of the edits, and it's likely that some manual review (what NLP folks call "human annotation") of the output will be necessary in any case. Finding or building a good user interface for efficient review of automated edits will be time well spent.

The target language is Asian, where 1) there are no spaces between words...

There's a small but potentially devilish detail if the text data being edited comes with line-breaks within sentences/paragraphs. If that's true for your data, do you know for certain whether or not any of the multi-character strings to edit might get split by a line break? (For languages that don't put spaces between words, when explicit line-breaks are used, they can happen anywhere, including the middle of a "linguistic" word.)

2) the encoding will be UTF-8.

This is simply a matter of making sure to use the appropriate IO layer discipline when reading and writing files. So long as all file handles are opened/set to "utf8", the regex stuff will take care of itself (character semantics will be used).

The following approach doesn't deal with the possible issue of line-breaks in the data, so that's "left as an exercise" if it turns out to be an issue for you. I found that the "stopword" list for the dummy example core -> center needed to be "enhanced" so that it wouldn't misfire on tokens containing "score", and that sort of issue is something that will probably occupy some of your time.

There's also a potential need to make sure that replacements are done in a specific order, e.g. if all "foo" must change to "bar", and all "baz" must change to "foo" (not to "bar"), you have to do the edits in that order. It's an easy thing to cope with, once you know enough about the data.

Finally, given the limited (and possibly misleading) nature of the sample data (text and edit directives), there's a decent chance that the following approach won't actually work for your application.

That said, the following uses the stop-lists to form patterns that match enough characters around the target word so that you can check whether any of the stop-words match.

#!/usr/bin/perl use strict; use Data::Dumper qw/Dumper/; my $text = <<EOT; fourscore and score years ago, we scored great scores in apple cores. it's time for an encore at the core of our cultural centre. in many centres where we travelled, Johann hasn't scored as well as he did in Johannesburg, where his score against Johannes Brahms shook us to our cores. EOT my %edit; while (<DATA>) { chomp; my ( $word, $repl, $stops ) = split /\t/; next unless ( length( $word ) and length( $repl )); my ( $pref_len, $suff_len ) = ( 0, 0 ); my @stops = split( /,/, $stops ); for my $stop ( @stops ) { my ( $pref, $suff ) = map { length( $_ ) } split( /\Q$word\E/, + $stop ); $pref_len = $pref if ( $pref_len < $pref ); $suff_len = $suff if ( $suff_len < $suff ); } my $pattern = sprintf( ".{0,%d}%s.{0,%d}", $pref_len, $word, $suff +_len ); $edit{$pattern} = { word => $word, repl => $repl, stop => join( '|', @stops ) }; } for my $pattern ( keys %edit ) { while ( $text =~ /($pattern)/g ) { my $edited = my $source = $1; next if ( $edit{$pattern}{stop} and $edited =~ /(?:$edit{$patt +ern}{stop})/ ); $edited =~ s/$edit{$pattern}{word}/$edit{$pattern}{repl}/; $text =~ s/\Q$source\E/$edited/; } } print $text; __DATA__ score twenty fourscore,scored,scores core center encore,coregent,score centre center travelled traveled hasn't has not Johann John Johannesburg
If any of your actual stop-word patterns happen to contain "regex-magic" characters, like ".?", they will be applied as such -- i.e. "a.?b" will match "ab" or "a.b" (any character in the middle), but will not work to match a literal period and question-mark surrounded by "a" and "b". I'm sure there's a way to enforce literal matches, but it might be tricky.

(P.S.: When I pasted the source code into the posting text-box, I did try to make sure there were literal tabs in the DATA lines -- I hope it comes through that way on download.)

In reply to Re: Efficient selective substitution on list of words by graff
in thread Efficient selective substitution on list of words by Polyglot

Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":

  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?

    What's my password?
    Create A New User
    and all is quiet...

    How do I use this? | Other CB clients
    Other Users?
    Others imbibing at the Monastery: (10)
    As of 2018-06-19 20:44 GMT
    Find Nodes?
      Voting Booth?
      Should cpanminus be part of the standard Perl release?

      Results (114 votes). Check out past polls.