http://www.perlmonks.org?node_id=952321

slugger415 has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks, this might be a dumb question, but I'm still learning.

I'm doing a series of string substitutions in a file. I have a hash, something like this:

my %subs = ( "Xaa1" => "sub1", "Xaa11" => "sub2", ); foreach my $s (keys %subs){ $data =~ s/$s/$subs{$s}/g; }

I'm sure you can see the problem here. If the string is "Xaa11" it might first substitute "Xaa1" and never see the final "1", so the result ends up being "sub11" instead of "sub2".

Am I going about this the right way, and how to I make sure it checks to the end of "Xaa11"?

Thank you!

Replies are listed 'Best First'.
Re: regex/substitution question
by ikegami (Patriarch) on Feb 07, 2012 at 18:41 UTC
    Using that approach, you'll also have problems with
    my %subs = ( foo => 'bar', bar => 'baz', );

    "foo" my end up being substituted with "baz".

    You want to search for all regex in the same pass

    s/(foo|bar)/$subs{$1}/g

    And you want to search for the longest matches first.

    s/(Xaa11|Xaa1)/$subs{$1}/g

    So what you want is

    my $pat = join '|', map quotemeta, sort { length($b) <=> length($a) } +keys(%subs); s/($pat)/$subs{$1}/g;

    Alternatively, if you're always matching entire words, you could forgo the sorting in favour of using anchors.

    my $pat = join '|', map quotemeta, keys(%subs); s/\b($pat)\b/$subs{$1}/g;

      Excellent! Thank you. I've not used "map" before so will have to study that one.

      Regarding the use of \b, I'm wondering if my hash name contains a period at the end, what would happen? This seems to work fine but am I causing a potential problem? I guess my question is, since . is also a word boundary, might it leave it in in the substitution?

      %subs = ("Xaa1." => "sub1"); foreach $s (keys %subs){ $data =~ s/\b$e\b[.]?/$subs{$s}/g; }

      Anyway thanks much, this is very helpful.

      UPDATE:

      Ignore that last question. What I meant was, if $data contains a period after Xaa1, not the name in %subs. Here is my Xaa1. data string.

      But I think I'm asking a confusing question.. so please ignore. :-)

        I'm wondering if my hash name contains a period at the end, what would happen?

        Depends on what you expect to follow the period, but I'm betting it wouldn't be appropriate to use \b.

Re: regex/substitution question
by AnomalousMonk (Archbishop) on Feb 07, 2012 at 23:47 UTC

    Another approach is to use something like Regexp::Assemble to create the search regex. This might advantageous if you have a large substitution table, the table is in a file, etc. This module assumes that regexes are being added by the  add() method, so the  map quotemeta step in that method is still highly recommended if you are searching for pure strings that might contain anything like a regex metacharacter. Sorting by length is implicit.

    >perl -wMstrict -le "use Regexp::Assemble; ;; my %subs = (Xaa1 => 'foo', Xaa11 => 'bar', Xaa => 'baz',); ;; my $ra = Regexp::Assemble ->new ->add(map quotemeta, keys %subs) ->anchor_word ; print $ra->re; ;; my $s = 'Xaa Xaa1 Xaa11 Xa'; $s =~ s{ ($ra) }{$subs{$1}}xmsg; print qq{'$s'}; " (?-xism:\bXaa(?:1?1)?\b) 'baz foo bar Xa'
      Interesting, thank you! -- Scott