|No such thing as a small change|
XML::Parser provides ways to parse XML documents.
Why use XML::Parser
Why NOT use XML::Parser
Besides the modules already mentioned:
XML::Parser is the basis of most XML processing in Perl. Even if you don't plan to use it directly, you should at least know how to use it if you are working with XML.
That said I think that it is usually a good idea to have a look at the various ;odules that sub-class XML::Parser, as they are usually easier to use.
There are some compatibility problems between XML::Parser version 2.28 and higher and a lot of other modules, most notably XML::DOM. Plus it seems to be doing some funky stuff with UTF-8 strings. Hence I would stick to version 2.27 at the moment.
Update: Activestate distribution currently includes XML::Parser 2.27
Things to know about XML::Parser
Characters are converted to UTF-8
XML::Parser will gladly parse latin-1 (ISO 8859-1) documents provided the XML declaration mentions that encoding. It will convert all characters to UTF-8 though, so outputting latin-1 is tricky. You will need to use Perl's unicode functions, which have changed recently so I will postpone detailed instructions until I catch-up with them ;--(
The XML recommendation mandates that when an error is found in the XML the parser stop processing immediatly. XML::Parser goes even further: it displays an error message and then die's.
To avoid dying wrap the parse in an eval block:
Getting all the character data
The Char handler can be called several times within a single text element. This happens when the text includes new lines, entities or even at random, depending on expat buffering mechanism. So the real content should actually be built by pushing the string passed to Char, and by using it only in the End handler.
Styles are handler bundles. 5 styles are defined in XML::Parser, others can be created by users.
Each time an element starts, a sub by that name is called with the same parameters that the Start handler gets called with.
Each time an element ends, a sub with that name appended with an underscore ("_"), is called with the same parameters that the End handler gets called with.
Parse will return a parse tree for the document. Each node in the tree takes the form of a tag, content pair. Text nodes are represented with a pseudo-tag of "0" and the string that is their content. For elements, the content is an array reference. The first item in the array is a (possibly empty) hash reference containing attributes.
The remainder of the array is a sequence of tag-content pairs representing the content of the element.
This is similar to the Tree style, except that a hash object is created for each element. The corresponding object will be in the class whose name is created by appending "::" to the element name. Non-markup text will be in the ::Characters class. The contents of the corresponding object will be in an anonymous array that is the value of the Kids property for that object.
If none of the subs that this style looks for is there, then the effect of parsing with this style is to print a canonical copy of the document without comments or declarations. All the subs receive as their 1st parameter the Expat instance for the document they're parsing.
It looks for the following routines:
This just prints out the document in outline form.