Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

Re: How to count substitutions on an array

by AnomalousMonk (Archbishop)
on Aug 13, 2016 at 18:30 UTC ( [id://1169719]=note: print w/replies, xml ) Need Help??


in reply to How to count substitutions on an array

If your substitutions are all literal substitutions, i.e., all of the form
    'banana' => 'plum'
    'xyzzy'  => 'whatever'
and never of the form
    /f[eio]e+/ => 'something'
and definitely not the sort of thing BillKSmith is doing here, then the following approach might be helpful. If you have many substitutions to make, the alternation of the  $search regex may grow quite large, but with the alternation trie optimization of Perl version 5.10, this should not be a problem — unless it gets really big! (My guess is that 10K search-replace pairs could be handled.)

c:\@Work\Perl>perl -le "use 5.010; ;; use warnings; use strict; ;; my %direct_substitution = ( 'apple' => 'PEAR', 'red' => 'YELLOW', 'xyzzy' => 'SOME OTHER THING', ); ;; my ($search) = map qr{ \b (?: $_ ) \b }xms, join q{ | }, keys %direct_substitution ; print $search; ;; my @lines = ( 'apple xapple apple applex xapplex apple', 'red', 'xxx red apple yyyy xyzzy zz', ); ;; my $count = 0; $count += s{ ($search) }{$direct_substitution{$1}}xmsg for @lines; ;; print qq{substitutions: $count}; print qq{'$_'} for @lines; " (?^msx: \b (?: apple | red | xyzzy ) \b ) substitutions: 7 'PEAR xapple PEAR applex xapplex PEAR' 'YELLOW' 'xxx YELLOW PEAR yyyy SOME OTHER THING zz'
Of course, you would go through your  @substitutionlist array to build the  %direct_substitution hash before processing the  @lines array.


Give a man a fish:  <%-{-{-{-<

Replies are listed 'Best First'.
Re^2: How to count substitutions on an array
by Anonymous Monk on Aug 13, 2016 at 19:35 UTC

    AnomalousMonk,

    I appreciate the time you have put into forming such detailed responses. I must admit that I do not fully understand this one, but may come back to review it later to see what I can learn from it. However, in my case, I don't think the substitution list itself can be better arranged.

    My situation has, at present, no regex expressions, only pure textual substitutions; however, it has substitutions of varying lengths requiring the longer ones to come first. The longest ones will be several sentences in length.

    For example, I might wish to substitute "every one" with "everyone" AFTER I have already substituted "every one of them" with "every one of them" (no change, but the substitution itself will get padded so that it will not match subsequently, thus preserving it from being incorrectly changed to "everyone" in that case). Another case would be to change "sore athirst" to "very thirsty" BEFORE changing all "athirst" instances to "thirsty." In my case, I am ordering the longest substitutions to take place first. I don't think Regex::Assemble would properly handle this. Efficiency must take second priority, though it is important because ongoing edits will require many executions of the script. I am needing to count each substitution so that I can checksum with the original files to ascertain the correct substitutions have indeed taken place.

      You say you are quite satisfied with your current solution, but perhaps this may be of interest for future reference.

      ... no regex expressions, only pure textual substitutions ... substitutions of varying lengths requiring the longer ones to come first. ... no change, but the substitution itself will get padded so that it will not match subsequently ...

      This raises a point I had overlooked before. It's possible to add longest-first discrimination when building an alternation. I've also made an attempt to add some acceptance of variable whitespace to the solution. There's also a feature to skip over certain phrases. This avoids the substitution of a substring with itself just to step over it in a possibly expensive no-op. (The weird capitalization is just to emphasize the substituted bits.)

      c:\@Work\Perl>perl -le "use 5.010; ;; use warnings; use strict; ;; my @skip_over = ( 'every one of them', 'all in good time', ); ;; my ($skip) = map qr{ \b (?: $_ ) \b (*SKIP) (*FAIL) }xms, join q{ | }, map qr{ \Q$_\E }xms, sort { length($b) <=> length($a) } @skip_over ; print qq{\$skip: $skip \n}; ;; my %direct_substitution = ( 'every one' => 'EVERYONE', 'sore athirst' => 'Very Thirsty', 'athirst' => 'THIRSTY', 'all in' => 'GONZO', ); ;; my ($capture) = map qr{ \b (?: $_ ) \b }xms, join q{ | }, map qr{ \Q$_\E }xms, sort { length($b) <=> length($a) } keys %direct_substitution ; print qq{\$capture: $capture \n}; ;; my $line = qq{every one wang chung every one \n} . qq{of them are sore athirst if not well athirst. \n} . qq{all in good time we will enjoy all in wrestling. \n} ; print qq{[[$line]] \n}; ;; $line =~ s{ \s+ }' 'xmsg; print qq{(($line)) \n}; ;; my $count = 0; $count += $line =~ s{ ($skip | $capture) } {$direct_substitution{$1}}xmsg; ;; print qq{substitutions: $count}; print qq{<<$line>> \n}; " $skip: (?^msx: \b (?: (?^msx: every\ one\ of\ them ) | (?^msx: all\ in +\ good\ time ) ) \b (*SKIP) (* FAIL) ) $capture: (?^msx: \b (?: (?^msx: sore\ athirst ) | (?^msx: every\ one +) | (?^msx: athirst ) | (?^msx : all\ in ) ) \b ) [[every one wang chung every one of them are sore athirst if not well athirst. all in good time we will enjoy all in wrestling. ]] ((every one wang chung every one of them are sore athirst if not well +athirst. all in good time we will enjoy all in wrestling. )) substitutions: 4 <<EVERYONE wang chung every one of them are Very Thirsty if not well T +HIRSTY. all in good time we will enjoy GONZO wrestling. >>
      (Some long output lines have been arbitrarily wrapped when composing this post.) Of course, I intend each  $line to be an element in an array over which you're looping.

      Update: Of course, it's possible to get rid of the
          $line =~ s{ \s+ }' 'xmsg;
      whitespace collapsing step and make the code even more whitespace agnostic.


      Give a man a fish:  <%-{-{-{-<

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1169719]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others avoiding work at the Monastery: (7)
As of 2024-04-19 09:24 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found