in reply to Regex to encode entities in XML

Generating valid XML for the CB might actuallly be harder than it looks as I am not sure how easy it is to figure the encoding of the messages.

The problem you have might be a bug in XML::Parser: If I use the regexp and then HTML::Entities I get the proper result with XML::Parser 2.27 but the wrong one with XML::Parser 2.30 (it looks like characters loose their UTF-8'edness with the latter).

The solution is either to use Text::Iconv or the Unicode modules as described in my first post about encodings, or to go module lifting once again and to grab code from XML::DOM:

sub safe_encode { my $str= shift; $str =~ s{([\xC0-\xDF].|[\xE0-\xEF]..|[\xF0-\xFF]...)} {XmlUtf8Decode ($1)}egs; return $str; } sub XmlUtf8Decode { my ($str, $hex) = @_; my $len = length ($str); my $n; if ($len == 2) { my @n = unpack "C2", $str; $n = (($n[0] & 0x3f) << 6) + ($n[1] & 0x3f); } elsif ($len == 3) { my @n = unpack "C3", $str; $n = (($n[0] & 0x1f) << 12) + (($n[1] & 0x3f) << 6) + ($n[2] & 0 +x3f); } elsif ($len == 4) { my @n = unpack "C4", $str; $n = (($n[0] & 0x0f) << 18) + (($n[1] & 0x3f) << 12) + (($n[2] & 0x3f) << 6) + ($n[3] & 0x3f); } elsif ($len == 1) # just to be complete... { $n = ord ($str); } else { die "bad value [$str] for XmlUtf8Decode"; } $hex ? sprintf ("&#x%x;", $n) : "&#$n;"; }

This will encode all non-ascii characters as &#nnn; where nnn is the code of the character in Unicode. This seems to display properly at least in Opera on Linux.

Let me know if this solves your problem.