Perl-Sensitive Sunglasses PerlMonks

### Re: Case-preserving substitutions

by I0 (Priest)
 on Jan 18, 2002 at 13:21 UTC

s/\b(\$old)\b/lc\$new^\$1^lc\$1/gie

Update:
In case \$new=~/\W/ or length\$new<length\$old it may be slightly more complicated:
s/\b(\$old)\b/lc\$new^(\$1^lc\$1)&(lc\$new^uc\$new)/gie

Replies are listed 'Best First'.
Re: Re: Case-preserving substitutions
by jryan (Vicar) on Jan 18, 2002 at 13:31 UTC

What did you do, golf mine? :)

I was unaware at that property of ^, very nice... in fact, I'm not ever sure why that works. Time for me to hit perlman:perlop.

Update:

Ok, its like this: ^ is XOR. That means the bit returns true if and only if one of the 2 bits is true:

```lc \$new # lowercase the replacement
^ \$1    # XOR with \$1 - will increase the values by
# the corresponding ascii value of each character
# of \$1
^ lc \$1 # XOR with lc \$1 - basically, it will subract the
# ascii values of lowercase \$1 - if the values were
# lowercase to begin with, the resulting sum is 0,
# otherwise, the increase is enough to uppercase
# the corresponding character
In short - an incredibly concise way of doing it. ++!

Re: Re: Case-preserving substitutions
by petral (Curate) on Jan 18, 2002 at 23:54 UTC
Another try at explaining this:

When dealing with 7-bit ascci, the Uppercase begins at 65 and the lowercase at 97 -- 32 higher.   Since 32 is a power of two represented by bit 5 of the character, if this bit is set, the letter is lc, if unset, Uc.
```\$ perl -lwe'\$,=\$\;print unpack("B*","A"), unpack("B*","a"), unpack"B*"
+,"A"^"a"'
01000001  <- "A": 64 + 1
01100001  <- "a": 64 + 32 + 1
00100000  <- result of XORing
The bit will be set only if the original was uppercase.   Since XORing something with itself is always 0, that is the only bit which can be set.   The lc of the replacement will have that bit set because that's what makes it lc, with other bits set to determine which letter.

So, bit 5 is set in the XORing of the original with its lc self only if the original is Uc (the opposite of the bits meaning!) and set in the lc replacement.   If they are both set XOR clears the result: hence Uc; if only the replacement is set it leaves it: lc.

I think at this point I should exclaim "QED" and run.   It seemed clear enough before I started trying to explain it in this little box!

update:   But note that jryan's answer above will work with any locale !

reupdate;   IO points out (and I should've checked) that capitalizing-by-resetting-bit-5 also works for the 8-bit characters in the standard ISO8859-1 ("latin-1") character set.

p

