Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic

Comment on

( #3333=superdoc: print w/replies, xml ) Need Help??
Deepest apologies for having skipped over the part of the OP that BrowserUK has considerately placed into focus for me.

Now that I understand it correctly, I try again in a separate reply.

Pardon me if I'm jumping to conclusions, but it seems like your notion of "stopwords" is really just a matter making sure that the "word" string is not part of a larger word. If that's really all it amounts to, all you need is to put the \b assertion around each word:

my %edits = ( score => 'twenty', core => 'center', centre => 'center', centres => 'centers', travelled => 'traveled', "hasn't" => 'has not', Johann => 'John' ); my $pattern = '\b(' . join( '|', keys %edits ) . ')\b'; while (<DATA>) { s/$pattern/$edits{$1}/g; print; } __DATA__ fourscore and score years ago, we scored great scores with apple cores. it's time for an encore at the core of our cultural centre. in many centres where we travelled, Johann hasn't scored as well as he did in Johannesburg, where his score against Johannes Brahms shook us to our cores.
The example data there points out a couple issues you may need to cope with using this approach:
  • spelling changes (e.g. "centre" to "center") will need to be specified for all inflected/derived forms ("centres", "centred", "centring") due to the use of the \b assertions
  • some replacements will be inappropriate due to ambiguous usage (e.g. "score" may be used in a context where it does not mean "twenty")
  • some replacements might produce awkward results (e.g. "core centre" becomes "center center") -- maybe that's a stretch, but it's relevant to the example that you provided.

But depending on the actual set of replacements you need to do, those issues are likely to be less bothersome than the problem of trying to figure out all the "stopwords" you would need to specify in order to avoid incorrect replacements within larger words.

In any case, the exercise as a whole really should be "previewed" or "monitored": for a given set of replacements and input data, get a listing of all the matches in the data, and/or review all changes applied by the process, to confirm that all changes are as intended. If you really are dealing with "natural language" data here, it pays to be really careful.

In reply to Re: Efficient selective substitution on list of words by graff
in thread Efficient selective substitution on list of words by Polyglot

Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":

  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?

    What's my password?
    Create A New User
    and all is quiet...

    How do I use this? | Other CB clients
    Other Users?
    Others romping around the Monastery: (4)
    As of 2018-05-25 23:15 GMT
    Find Nodes?
      Voting Booth?