http://www.perlmonks.org?node_id=368121

moleary has asked for the wisdom of the Perl Monks concerning the following question:

I need to write a script that parses several XML files, collecting information from them to write to an output file. I have used XML::Parser to do this kind of thing before, but only on files that have a !DOCTYPE line at the top that indicates a dtd file for the XML file being parsed. The XML files I am working with now do not have !DOCTYPE lines. It apparently will not necessarily be possible for users of my script to edit the XML files and insert a !DOCTYPE line in them, so I have to use the XML files as they are. Now, when I run the first draft of my script on one of these files, it returns an error when it hits a line like: <restriction text="Cannot be used worldwide for Art d&eacute;cor" /> The entity declarations in a dtd file would indicate how &eacute; should be interpreted, and I am wondering what I can do to parse tags that contain entities like this when a dtd file is not available. I have tried various things with the handlers and other arguments of XML::Parser, but I haven't found the right configuration to make it work. Is there something that I have simply overlooked, or do I need to use a different approach?
  • Comment on Using XML::Parser on a file without a !DOCTYPE line

Replies are listed 'Best First'.
Re: Using XML::Parser on a file without a !DOCTYPE line
by mojotoad (Monsignor) on Jun 19, 2004 at 08:54 UTC
    Sounds like you need to install some new handlers for Doctype and DoctypeFin as described here. Can you fake it out with a doctype of your own devising which can handle such cases?

    Cheers,
    Matt

      In playing around with the XML::Parser, I have found that the Doctype handler is called when the parser finds a !DOCTYPE line and the DoctypeFin handler is called when parser is done processing the !DOCTYPE line. If you specify a dtd file in a !DOCTYPE line, this handler will be called when the parser is done parsing the dtd file. But the XML files I need to work with do not have !DOCTYPE lines, so these handlers would not be called. After reading your reply and the other replier's reply, I am wondering whether I can make two calls to the parser, one in which I pass it a string that consists of a !DOCTYPE line containing the appropriate dtd file, and the second in which I pass it the XML file. Will the parser remember the declarations in the dtd file that it reads in the first call when it is parsing the XML file's contents in the second call?
Re: Using XML::Parser on a file without a !DOCTYPE line
by matija (Priest) on Jun 19, 2004 at 06:10 UTC
    If the propper DOCTYPE were to be inserted, would it be the same doctype for all the documents? If so, you could append that line yourself, before you give it to XML::Parser to parse.
      There is a separate dtd file for each of the XML files I need to read, but I do know which dtd file goes with which XML file, so that could work. What do you mean by appending the line before giving the file to the parser? Do you mean to modify the XML file in some way, or to feed the !DOCTYPE line to the parser first and then give the xml file to the parser, or something else? Would it work to call the parser's function that takes a string with the !DOCTYPE line and then call the parser's function that takes a file handle with the XML file? Would the parser apply the dtd file declarations to the XML file if I made these two calls?
Re: Using XML::Parser on a file without a !DOCTYPE line
by mojotoad (Monsignor) on Jun 22, 2004 at 17:26 UTC
    This question kept nagging at me. I took another look and found the following parameter that can be passed to your XML::Parser constructor:
    my $p = XML::Parser->new( ParseParamEnt => 0, );
    Also it comes in handy to do your own error handling when dealing with crufty XML -- otherwise the parser will fatally croak:
    my $p = XML::Parser->new( ErrorContext => 2, ParseParamEnt => 0, );

    FYI, ParseParamEnt appears to only apply to Expat under the hood.

    Cheers,
    Matt