http://www.perlmonks.org?node_id=1025918

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

hiya Monks, In XML it is showing proper "United Kingdom" but after loading the xml into site database, its showing "United Kingdom". What could be the problem? I am not a Perl Man, but the script is in Perl!. How can I remove the rubbish by adding codes into the script. Thanks a lot.
  • Comment on How to remove the rubbish from "United Kingdom"

Replies are listed 'Best First'.
Re: How to remove the rubbish from "United Kingdom"
by Anonymous Monk on Mar 28, 2013 at 10:11 UTC
Re: How to remove the rubbish from "United Kingdom"
by Tux (Canon) on Mar 28, 2013 at 10:31 UTC

    A non-breaking space? How is the final data en/de-coded? What is the source-encoding? Is the encoding declared in the XML?


    Enjoy, Have FUN! H.Merijn
Re: How to remove the rubbish from "United Kingdom"
by space_monk (Chaplain) on Mar 28, 2013 at 11:18 UTC
    The encoding is normally specified at the start of the XML document:
    <?xml version="1.0" encoding="windows-1252"?> or <?xml version="1.0" encoding="UTF-8"?>
    It is also possible the field in your database table does not save characters in the same format as the xml document. On some databases there is a default character storage setting.
    A Monk aims to give answers to those who have none, and to learn from those who know more.
Re: How to remove the rubbish from "United Kingdom"
by sundialsvc4 (Abbot) on Mar 28, 2013 at 15:45 UTC

    I would suggest that you examine the XML file using a hex-editor so that you can see for yourself what the bytes-and-bits contain.   There might be a Unicode string.   Or, in some cases, there might be a &metacharacter; in the actual data.   But a hex-editor will avoid any attempts by any program to “decode” what the data actually is.   You will see it plain.

    I would say that the next step would be to do the same thing, now, with the database record.   Most SQLs have a function that will display the content of a text or BLOB field as binary.   Once again, what is, byte-for-byte, in that database record?

    Now comes the fun part ... the SQL metadata for a column can include a character-set specification, which tells interested programs how the data should be interpreted (by those programs that actually listen to the metadata).   Your terminal or shell program is actually the same way ... it, too, has to make guided assumptions about how to properly interpret those bytes.   You need to be cautious that it is not misleading you into perceiving “a problem” that the consumer of the data won’t perceive.

    It is a good idea for stored XML data to explicitly include a character-set specification.   Even if it is not stored in the record (relying instead on metadata), it’s a good idea to provide the <?xml ...> tag on any subsequent output or use of the data that is stored.   XML is meant to be self-describing and needs that description.

    P.S. Removing the rubbish from the United Kingdom appears to be a political problem... ;-)
    It appears that someone took offense at this American attempt at humor.