http://www.perlmonks.org?node_id=852656


in reply to stripping characters from html

I am finding certain characters are breaking the script.

In what way are they breaking the script?  Maybe you just need to entity-encode those characters (preferably use numeric entities (encode_entities_numeric()), as in contrast to HTML, in XML only very few named entities are predefined (i.e. work without explicit entity declarations)).   Does ∫ really cause an error?

Alternatively, try specifying an appropriate encoding (in the first line of the XML file: <?xml version="1.0" encoding="..."?>).

Or, as a last resort, simply strip everything outside of the ASCII range.