Beefy Boxes and Bandwidth Generously Provided by pair Networks
Just another Perl shrine

Invalid XML characters

by Anonymous Monk
on Mar 23, 2009 at 08:18 UTC ( #752527=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to find a perl package (or a regex) to remove invalid characters from an input XML file. The file contains characters like é â, etc, due to which parsing of the XML file fails.

Replies are listed 'Best First'.
Re: Invalid XML characters
by mirod (Canon) on Mar 23, 2009 at 08:52 UTC

    Are you sure that's what you want to do? I would think that you would rather replace the entities (that's what they are called) by their unicode or latin 1 character.

    That can be done by adding a DTD to the file, that contains the proper declaration for the entities. The entity declaration file you want is probably one referenced by the XHTML spec.

    This way you don't have to rely of brittle regexps to do the job, the parser will do it for you. If your input is really HTML, which follows SGML syntax, then entities can be trickier to match than you might think, the final semi-colon is optional in certain cases for example. Post a follow-up if you need to know how to add a proper DTD to your XML (with an extract of the beginning of the file you have).

      Thanks for your quick response. I had no knowledge about the entity declaration files that you mentioned. Thus, I was looking for some regex or package that will do the job.
      I am currently using XML::Smart for fetching and parsing the XML file. Can you please guide me further about using the entity declaration file along with XML::Smart, so that it will parse the file without crashing.

        As I said, I need to look at a sample of your data, at least the beginning, to see where and how to put the reference to the entity declarations.

        Note that XML::Smart can parse XML 2 different ways: either it uses XML::Parser, which is a real parser and will deal with entities and all the XML features, or it can use its own parse (XML::Smart::Parser, in which case you might be out of luck, according to the docs. If that's the situation you're in, I would consider either using an other XML library, or, if you have to use XML::Smart, pre-processing your XML using xmllint or something similar. Invest a little time now to probably avoid annoying and hard to find bugs in the future.

Re: Invalid XML characters
by eff_i_g (Curate) on Mar 23, 2009 at 14:38 UTC
Re: Invalid XML characters
by Bloodnok (Vicar) on Mar 23, 2009 at 11:45 UTC
    Are you sure it's an XML file ? Both é & â look like HTML markup...

    A user level that continues to overstate my experience :-))
      The two aren't mutually exclusive. HTML has an XML serialisation called XML.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://752527]
Approved by Corion
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others drinking their drinks and smoking their pipes about the Monastery: (6)
As of 2021-01-21 03:18 GMT
Find Nodes?
    Voting Booth?