The source of the data is a large number of RSS feeds used which point to an even larger number of individual web pages. The latter are what are harvested and processed with a few scripts. So normalizing the data at the source is not an option, since so few webmasters even publish mail addresses let alone fix their sites.
Maybe there is a CPAN module or simple method to forcibly convert the incoming data (or outgoing data) to UTF? Just calling it UTF-8 fails, too: binmode(STDOUT, ":encoding(utf8)"); Is there a way to find out if it should be labeled UTF-16 instead? If so then how to force that mode?
$ apt-cache policy perl | head -n 3
perl:
Installed: 5.28.1-6
Candidate: 5.28.1-6