Are you sure that's what you want to do? I would think that you would rather replace the entities (that's what they are called) by their unicode or latin 1 character.
That can be done by adding a DTD to the file, that contains the proper declaration for the entities. The entity declaration file you want is probably one referenced by the XHTML spec.
This way you don't have to rely of brittle regexps to do the job, the parser will do it for you. If your input is really HTML, which follows SGML syntax, then entities can be trickier to match than you might think, the final semi-colon is optional in certain cases for example. Post a follow-up if you need to know how to add a proper DTD to your XML (with an extract of the beginning of the file you have).
| [reply] |
Hi,
Thanks for your quick response. I had no knowledge about the entity declaration files that you mentioned. Thus, I was looking for some regex or package that will do the job.
I am currently using XML::Smart for fetching and parsing the XML file. Can you please guide me further about using the entity declaration file along with XML::Smart, so that it will parse the file without crashing.
Thanks
| [reply] |
As I said, I need to look at a sample of your data, at least the beginning, to see where and how to put the reference to the entity declarations.
Note that XML::Smart can parse XML 2 different ways: either it uses XML::Parser, which is a real parser and will deal with entities and all the XML features, or it can use its own parse (XML::Smart::Parser, in which case you might be out of luck, according to the docs. If that's the situation you're in, I would consider either using an other XML library, or, if you have to use XML::Smart, pre-processing your XML using xmllint or something similar. Invest a little time now to probably avoid annoying and hard to find bugs in the future.
| [reply] |
| [reply] |
| [reply] [d/l] [select] |
The two aren't mutually exclusive. HTML has an XML serialisation called XML.
| [reply] |