|Problems? Is your data what you think it is?|
The Unicode Bug with Transliteration or Substitutionby choroba (Canon)
|on May 02, 2014 at 21:50 UTC||Need Help??|
choroba has asked for the
wisdom of the Perl Monks concerning the following question:
Hi brethren and sestren.
On several machines at work, we run Perl 5.8.3 (yes, I know it's 10 years old; not my choice). We noticed a strange behaviour recently: we used
to process some HTML files. If the files contained non-latin characters (e.g. Chinese), on some machines the output was garbled. We tried to replace tr with substitution
and suddenly, the output was correct.
Both input and output are marked with :encoding(utf-8). The files must be slurped in to trigger the bug, line-by-line processing produces the correct output.
Could this be one of the manifestations of The "Unicode Bug"? I have the gut feeling that the substitution might solve the problem for the given file, but the bug could reappear with the next different file. I also don't understand why the bug only appeared on some machines - the version of Perl is the same on all of them (but their Linux version is different). Is any external library involved in transliteration, substitution, or unicode handling?
BTW: I wasn't able to install 5.8.3 at home (errors during make) to test further. Update: I was able to install it with the help of Devel::PatchPerl. I wasn't able to replicate the problem, though.
لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ