http://www.perlmonks.org?node_id=1230284

haukex has asked for the wisdom of the Perl Monks concerning the following question:

In this StackOverflow question, user "Learner" asked how to replace a set of certain letters in the first column of a file with numbers. In other words, given this set of replacements:

A=>1, B=>5, C=>6, D=>4, E=>7, F=>16, G=>10, H=>11, I=>12, K=>14, L=>13, M=>15, N=>3, P=>17, Q=>8, R=>2, S=>18, T=>19, V=>22, W=>20, Y=>21, Z=>9

And this input:

NDDDDTSVCLGTRQCSWFAGCTNRTWNSSA 0 VCLGTRQCSWFAGCTNRTWNSSAVPLIGLP 0 LTWSGNDTCLYSCQNQTKGLLYQLFRNLFC 0 CQNQTKGLLYQLFRNLFCSYGLTEAHGKWR 0 ITNDKGHDGHRTPTWWLTGSNLTLSVNNSG 0 GHRTPTWWLTGSNLTLSVNNSGLFFLCGNG 0 FLCGNGVYKGFPPKWSGRCGLGYLVPSLTR 0 KGFPPKWSGRCGLGYLVPSLTRYLTLNASQ 0 QSVCMECQGHGERISPKDRCKSCNGRKIVR 1

The expected output is:

3 4 4 4 4 19 18 22 6 13 10 19 2 8 6 18 20 16 1 10 6 19 3 2 19 20 3 18 +18 1 22 6 13 10 19 2 8 6 18 20 16 1 10 6 19 3 2 19 20 3 18 18 1 22 17 13 12 + 10 13 17 13 19 20 18 10 3 4 19 6 13 21 18 6 8 3 8 19 14 10 13 13 21 8 13 16 2 3 + 13 16 6 6 8 3 8 19 14 10 13 13 21 8 13 16 2 3 13 16 6 18 21 10 13 19 7 1 11 10 + 14 20 2 12 19 3 4 14 10 11 4 10 11 2 19 17 19 20 20 13 19 10 18 3 13 19 13 18 +22 3 3 18 10 10 11 2 19 17 19 20 20 13 19 10 18 3 13 19 13 18 22 3 3 18 10 13 16 16 + 13 6 10 3 10 16 13 6 10 3 10 22 21 14 10 16 17 17 14 20 18 10 2 6 10 13 10 21 13 22 + 17 18 13 19 2 14 10 16 17 17 14 20 18 10 2 6 10 13 10 21 13 22 17 18 13 19 2 21 13 1 +9 13 3 1 18 8 8 18 22 6 15 7 6 8 10 11 10 7 2 12 18 17 14 4 2 6 14 18 6 3 10 2 14 12 + 22 2

Here are my two solutions:

$ perl '-M5;%h=map{$_,++$i}split//,"ARNDBCEQZGHILKMFPSTWYV"' -alpe ' ($_=$F[0])=~s/[A-Z]/$h{$&} /g' $ perl -alpe ' ($_=$F[0])=~s/[A-Z]/(index("ARNDBCEQZGHILKMFPSTWYV",$&)+1)." "/ge'

WebPerl link

I personally consider the extra space character at the end of the line produced by my solutions acceptable (I think diff -b is probably ok too). Unfortunately the OP didn't specify what would happen in case the input strings contained letters that aren't in the set, so I guess "bonus points" for solutions that only affect [A-IK-NP-TV-WY-Z] instead of [A-Z] like my solution does. Bonus question: Can anyone come up with a short, preferably pure Perl, solution to produce such a regex character set for any given list of letters?

$ echo "ARNDBCEQZGHILKMFPSTWYV" | perl -MSet::IntSpan -ple ' $_=Set::IntSpan->new([map{ord}split//,$_])->run_list; s/\d+/chr$&/eg;s/,//g;$_="[$_]"'

Have at it ;-)

Update: Thank you for the inspired and inspiring responses so far, Discipulus, Eily, rsFalse, Veltro, and vr! I really enjoy the creativity in the solutions :-)