Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW

Chaining string ops

by traveler (Parson)
on Aug 28, 2003 at 16:46 UTC ( [id://287445] : perlquestion . print w/replies, xml ) Need Help??

traveler has asked for the wisdom of the Perl Monks concerning the following question:

A recent post to SOPW (which I cannot locate just now) asked how to shorten something like this:
$str = "the boy walked the dog"; $str =~ s/walked/fed/; $str =~ s/boy/girl/; $str =~ s/dog/Audrey II/;
by removing all the $str =~ stuff. There were multiple solutions, but I think the question was really along the lines of why we can't access the matched string easily. That is, whey can't we chain the substitutions? Wouldn't chaining the substutions be nice? Is there a string class that allows this kind of thing? It just seems more perlish to be able to chain the substitutions.


Replies are listed 'Best First'.
Re: Chaining string ops
by Aristotle (Chancellor) on Aug 28, 2003 at 17:37 UTC
    for($str = "the boy walked the dog") { s/walked/fed/; s/boy/girl/; s/dog/Audrey II/; }

    Makeshifts last the longest.

Re: Chaining string ops
by seattlejohn (Deacon) on Aug 28, 2003 at 17:20 UTC
    Something like this?

    package StringObject; use strict; use warnings; use overload q{""} => \&to_string; sub new { my ($class, $string) = @_; return bless \$string, $class; } sub s { my ($self, $match, $substitute) = @_; $$self =~ s/$match/$substitute/; return $self; } sub to_string { my ($self) = @_; return $$self; } 1;

    Quick inspection tests:

    use strict; use warnings; use StringObject; my $s1 = StringObject->new("the boy walked the dog"); print $s1, "\n"; $s1->s('walked', 'fed')->s('boy', 'girl')->s('dog', 'Audrey II'); print $s1, "\n"; exit;

    You'd probably want to overload the other stringy functions (comparisons, concat) as well to make the semantics consistent.

    As written, this won't handle backreferences in the replacement string. There are probably other limitations. I personally don't see a lot of benefits to this substitution chaining idiom in the first place, but I thought it might be fun to show an example of blessing something other than a hashref.

            $perlmonks{seattlejohn} = 'John Clyman';

Re: Chaining string ops
by davido (Cardinal) on Aug 28, 2003 at 17:39 UTC
    I'll present two methods which work fairly well. The first method is cleaerer to read, but if the list of substitutions is long, it can lead to excessive typing.

    The first method favors long lists of strings to undergo the same substitutions. For example, if $str were actually a large array called @str, you would probably favor the first method.

    The second method favors lots of substitutions to be performed for a single string.

    my $str = "the boy walked the dog"; for ( $str ) { s/walked/fed/; s/boy/girl/; s/dog/Audrey II/; }

    This method just relies on the fact that the for statement causes $_ to alias $str within the scope of the for statement. And regexp binding (=~) binds to $_ if there is no other variable specified.

    You could also do it this way:

    my $str = "the boy walked the dog"; my %subs = ( "walked", "fed", "boy", "girl", "dog", "Audrey II" ); foreach ( keys %subs ) { $str =~ s/$_/$subs{$_}/; }

    This method is pretty much opposite of my first example. But it can be handy if you have a lot of substitutions.


    "If I had my life to do over again, I'd be a plumber." -- Albert Einstein

      You can get rid of the loop in your second example entirely.
      my %subst_for = ( "walked" => "fed", "boy" => "girl", "dog" => "Audrey II" ); $_ = "the boy walked the dog"; my $alt = join '|', map quotemeta, keys %subs; s/($alt)/$subst_for{$1}/g;

      Makeshifts last the longest.

        Yes, you can do that, which creates a potentially huge alternation list for the regexp engine to swollow. I'm not sure that would be more efficient though, neither is it as clear.

        Alternation is a convenient tool inside a regexp, and I'm glad to see an example of how to make it work in the context of substituting multiple items with multiple substitute items.

        But alternation is also very inefficient, and can lead to a lot of backtracking even in relatively simple regular expressions. In that context, a loop may be more time efficient. Only benchmarking could say for sure.


        "If I had my life to do over again, I'd be a plumber." -- Albert Einstein

Re: Chaining string ops (multiple searches in linear time)
by grinder (Bishop) on Aug 28, 2003 at 20:10 UTC

    The main problem I have with these solutions is that the search are carried out serially. And in the the earlier thread (heh, I can't find it either, PTAV doesn't work and I've pounded on Super Search too long) on the subject, someone pointed out the dangers of replacement ordering. Mapping woman to girl and man to boy just might leave you with woboy.

    update: found the original thread: multiple (different) substitutions on same $foo, thanks to jcwren fixing a bug in PTAV.

    There are algorithms that perform single-pass (modulo setup) searching for multiple targets in a string. They can provide dramatic gains when you're matching many patterns, and they solve the ordering problem above. I remember Aho-Corasick and Knuth-Morris-Pratt from school, and a search on the Web turned up one I hadn't heard of, namely Commentz-Walter. (Google for these terms, with additional terms, including pattern string multiple search).

    I think Sedgewick covers the first two algorithms (but my copy's at work). Otherwise a reference implementation appears to be available via ftp here:

    From what I recall, these algorithms only search, they don't replace. As you scan the string, you copy over the unmatched runs to the result string, and each time you hit a match you figure out what to replace it with. This part is tricky. If you're searching a for a regexp, you can't use what you matched as a key to a hash lookup, to find what you want to replace. That is, for a hypothetical string rice flies like sand and performing the following replacements (bear with me on the mangled syntax):

    /(i[a-z]*e)/ => "ubb$1" /([a-z])([ld])/ => "${1}o$2"

    The resulting string would (if I am not mistaken) become rubbice folubbies lubbike sanod.

    You can't use a hash lookup here, because the regexp /(i[a-z]*e)/ matches ice, ie and ike. In other words, meta-characters prevent you from using the result as a key. If you don't use meta-characters, then you can. Otherwise you need to determine the index within the list of sought patterns instead and then apply the replacement.

    I investigated this approach some years ago. I needed to replace several hundred patterns in 5-10 megabyte files. At the time no modules existed on CPAN to do this. I lost interest in the approach due to time constraints (I just suffered the pain of figuring out the ordering and applying the replacements serially), but if such a module does exist (and has an elegant, intuitive interface -- this is definitely hard), I'd love to hear about it.

Re: Chaining string ops
by asarih (Hermit) on Aug 28, 2003 at 18:14 UTC
    If I understand you correctly, you want to say
    $str = "the boy walked the dog"; $str =~ s/walked/fed/ =~ s/boy/girl/ =~ s/dog/Audrey II/; # error
    Here's why it shouldn't work. Since s/// returns a number of times a match is found, anyway you associate =~, you end up with something like
    $str =~ 1
    at some point. This is not legal. If you want to chain =~, then we must expect s/// to return a string.

    Specifically, we must get one that does regular expressions such as s/this/that/ or m[^/usr/local/bin], but then this prohibits us from saying

    if ($str =~ s/boy/girl/) { # do this }
    because the condition will evaluate to false. I'd rather have if ( $s =~ /match/ ) { .. } rather than the chain-ability of s///.

    Update: Above discussion assumes that =~ is right associative, but in fact it is left associative. This means that as long as =~ returns an integer, we have a trouble (say 1=~ s/this/that/). I like roju's idea of returning a string, but I don't know how easy/hard it is to implement it.

      Could we not have it both ways? After all, this is perl....

      if ($s =~ s/change/me/)
      $str = "some string" =~ s/a/b/r =~ s/foo/bar/r

      where /r is the newly created "chaining" modifier that forces a /r/eturn of the new string. Make it have no side effects either (ie. not touch the original string), and it'd fill a need.

      Update: changed /c to /r, seeing as /c is taken.

      Yes. I understand this. The question is which is more useful: returning the number of matches, or returning the changed string? Both are useful! I like roju's suggestion of a new string modifier.


Re: Chaining string ops
by fletcher_the_dog (Friar) on Aug 28, 2003 at 18:04 UTC
    Here is a way you could encapsulate the chaining in a subroutine
    use strict; my $str = "the boy walked the dog"; $str=ChainSubs($str, walked=>'fed', boy=>'girl', dog=>'Audrey II' ); print $str; sub ChainSubs{ my ($string,%subs)=@_; foreach my $original (keys %subs) { $string=~s/$original/$subs{$original}/; } $string; }
    This of course assumes that you don't care what order the substitutions are done in. If you did care, you could walk through @_ two elements at a time like this:
    sub ChainSubs{ my $string = shift; for (my $i=0;$i<@_;$i+=2) { $string=~s/$_[$i]/$_[$i+1]/g; } return $string; }
      Using splice is easier than doing it with for(;;).
      sub ChainSubs{ local $_ = shift; while(@_) { my ($s, $t) = splice @_, 0, 2; s/$s/$t/g; } return $_; }

      Makeshifts last the longest.