in reply to stripping characters from html
I am finding certain characters are breaking the script.
In what way are they breaking the script? Maybe you just need to entity-encode those characters (preferably use numeric entities (encode_entities_numeric()), as in contrast to HTML, in XML only very few named entities are predefined (i.e. work without explicit entity declarations)). Does ∫ really cause an error?
Alternatively, try specifying an appropriate encoding (in the first line of the XML file: <?xml version="1.0" encoding="..."?>).
Or, as a last resort, simply strip everything outside of the ASCII range.