http://www.perlmonks.org?node_id=1025987


in reply to How to remove the rubbish from "United Kingdom"

I would suggest that you examine the XML file using a hex-editor so that you can see for yourself what the bytes-and-bits contain.   There might be a Unicode string.   Or, in some cases, there might be a &metacharacter; in the actual data.   But a hex-editor will avoid any attempts by any program to “decode” what the data actually is.   You will see it plain.

I would say that the next step would be to do the same thing, now, with the database record.   Most SQLs have a function that will display the content of a text or BLOB field as binary.   Once again, what is, byte-for-byte, in that database record?

Now comes the fun part ... the SQL metadata for a column can include a character-set specification, which tells interested programs how the data should be interpreted (by those programs that actually listen to the metadata).   Your terminal or shell program is actually the same way ... it, too, has to make guided assumptions about how to properly interpret those bytes.   You need to be cautious that it is not misleading you into perceiving “a problem” that the consumer of the data won’t perceive.

It is a good idea for stored XML data to explicitly include a character-set specification.   Even if it is not stored in the record (relying instead on metadata), it’s a good idea to provide the <?xml ...> tag on any subsequent output or use of the data that is stored.   XML is meant to be self-describing and needs that description.

P.S. Removing the rubbish from the United Kingdom appears to be a political problem... ;-)
It appears that someone took offense at this American attempt at humor.