Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery

Parsing an XML-like definition of an XML-like language to create a parser of the actual data in that language.

by Moron (Curate)
on Nov 16, 2006 at 18:53 UTC ( #584569=perlquestion: print w/replies, xml ) Need Help??
Moron has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks, A utility dumps the transaction log of a time-series DBMS (hereafter TSDBMS) transaction log in XML format, or so the vendor´s manual says. They provide what they call a DTD of this format (I know BNF but never saw a DTD before). The output sort of resembles XML as I know it but hmmm not quite. I have a limited budget (say 8 hours) to create a parser for this output so I can read it into another program which will cross-reference it with information from elsewhere - that part is not a problem.

The bit that is giving me grief is that the ´DTD´ below has to be used to create the rules for parsing the XMLish output but it looks horribly like it´s written the language it is actually trying to describe, i.e. what you see below strongly resembles what it is trying to define. Normally I´d expect to have to write something this weird from scratch, but with a limit 8 hours development time available I do wonder if I can really dive into that rat-hole with a clear conscious and no-one would blame me for looking for a module even if in vain. On the other hand, the circular-describing problem tends to make it look too perverse to expect to find a module. So it seems to me I can´t avoid writing the parser (although not necessarily the lexer) from scratch. Or am I missing something? (so-called DTD of the so-called XML format follows...)

Update: I forgot to mention that irrespective of whether a module can help me, I will write this parser in Perl anyway - although the requirement is forebodingly weird, I have to ask just in case there is a module that can help.

<!ELEMENT transactions (trans*)> <!ATTLIST transactions producer CDATA # REQUIRED timestamp CDATA # REQUIRED > <!ELEMENT trans (file_trans | rdbms_trans | object_trans | proc_trans)> <!ATTLIST trans state ('committed' | 'written')# REQUIRED originator CDATA# REQUIRED number CDATA# REQUIRED uid CDATA # REQUIRED timestamp CDATA# REQUIRED > <!ELEMENT file_trans ( tree_id, symbol, trans_no, filerecord*, correctiorecord*, datarecord* )> <!ELEMENT tree_id (#PCDATA)> <!ELEMENT trans_no (#PCDATA)> <!ELEMENT filerecord (fileheader*)> <!ATTLIST filerecord type ('ADD'|'INSERT'|'UPDATE'|'DELETE')# REQUIRED > <!ELEMENT fileheader (fileattr*)> <!ATTLIST fileheader headerstate ('new'|'old')# REQUIRED> <!ELEMENT fileattr (value*)> <!ATTLIST fileattr id CDATA# REQUIRED > <!ATTLIST fileattr <!ELEMENT correctionrecord (dataitem)> <!ATTLIST correctionrecord type ('ADD'|'INSERT'|'UPDATE'|'DELETE')# REQUIRED > <!ELEMENT datarecord (dataitem)> <!ATTLIST datarecord type ('ADD'|'INSERT'|'UPDATE'|'DELETE')# REQUIRED > <!ELEMENT dataitem (attrid, value)> <!ELEMENT proc_trans (proc*)> <!ELEMENT proc (#PCDATA)> <!ATTLIST proc procname CDATA# REQUIRED > <!ELEMENT rdbms_trans (rdbms_operation)> <!ELEMENT rdbms_operation (sql, alt?, cell*)> <!ELEMENT sql (#PCDATA)> <!ELEMENT alt (#PCDATA)> <!ELEMENT cell (#PCDATA)> <!ATTLIST cell col CDATA# REQUIRED row CDATA# REQUIRED > <!ELEMENT object_trans (object_operation*)> <!ELEMENT object_operation (object_attribute*)> <!ATTLIST object_operation type ( 'ADD' | 'REBUILD' | 'CHANGEFLD' | 'DELETE' | 'EXTEND' | 'TRIGGER' | 'INSERT' | 'RENAME' | 'TRUNCATE' | 'UPDATE' | 'UNKNOWN' )# REQUIRED continue ('yes' | 'no')# REQUIRED > <!ELEMENT object_attribute ( ado | attr | ado_code | ado_attributes | ado_codes | list | list_entry | list_cont | curve | curve_entry | curve_cont | stat_attr | enum_value | stat_attr_enum | stat_attr_code | df_attr | df_attr_code | tree | calendar | holidaydef | holidays | interfacedef | parameterdef | interface_parameters | mapping | interface_mapping | formula | derived_list | derived_curve)*> <!ATTLIST object_attribute attributestate ('new'|'old') # IMPLIED > <!-- Object types --> <!ELEMENT ado (symbol, longname, owner, group, permissions, template, parent)> <!ELEMENT ado_code (source, attrnum, date, code, formulaid)> <!ELEMENT ado_attributes (symbol,attr*)> <!ELEMENT attr (attrid, date, value, status)> <!ELEMENT ado_codes (symbol,ado_code*)> <!ELEMENT list (listid, longname, owner, group, permissions, listtp)> <!ELEMENT list_cont (listid, list_entry*)> <!ELEMENT list_entry (key)> <!ELEMENT curve (listid, longname, owner, group, permissions, daycountbasis, payment_freq, interpolation_method, curvetp)> <!ELEMENT curve_cont (listid, curve_entry*)> <!ELEMENT curve_entry (key, attrnum)> <!ELEMENT stat_attr (attrid, longname, owner, group, permissions, datatype, multivalued, unique, optional, profiling, derived, formulaid)> <!ELEMENT enum_value (key, value)> <!ELEMENT stat_attr_enum (attrid, enum_value*)> <!ELEMENT stat_attr_code (attrid, ado_code*)> <!ELEMENT df_attr (attr_num, attrid, owner, group, permissions, datatype, length,width, precision, check, correct, rebase, formulaid)> <!ELEMENT df_attr_code (attrid, ado_code*)> <!ELEMENT tree (symbol, longname, owner, group, permissions, root_node, depth, replicate, formulaid)> <!ELEMENT calendar (calnum, longname, owner, group, permissions, allowsaturday, allowsunday)> <!ELEMENT holidaydef (date, holiday)> <!ELEMENT holidays (longname, holidaydef*)> <!ELEMENT interfacedef (key, interface, owner, group, permissions)> <!ELEMENT parameterdef (parameter, value)> <!ELEMENT interface_parameters (longname, parameterdef*)> <!ELEMENT mapping (code, fromvalue, tovalue)> <!ELEMENT interface_mapping (longname, mapping*)> <!ELEMENT formula (formid, longname, owner, group, permissions, viewtype, dimension, value)> <!ELEMENT derived_list (listid, longname, owner, group, permissions, listtp, formulaid)> <!ELEMENT derived_curve (listid, longname, owner, group, permissions, daycountbasis, payment_freq, interpolation_method, curvetp, formulaid, recalc_time)> <!-- Field types --> <!ELEMENT symbol (#PCDATA)> <!ATTLIST symbol fldtag CDATA> <!ELEMENT longname (#PCDATA)> <!ATTLIST longname fldtag CDATA> <!ELEMENT owner (#PCDATA)> <!ATTLIST owner fldtag CDATA> <!ELEMENT group (#PCDATA)> <!ATTLIST group fldtag CDATA> <!ELEMENT permissions (#PCDATA)> <!ATTLIST permissions fldtag CDATA> <!ELEMENT template (#PCDATA)> <!ATTLIST template fldtag CDATA> <!ELEMENT parent (#PCDATA)> <!ATTLIST parent fldtag CDATA> <!ELEMENT attrid (#PCDATA)> <!ATTLIST attrid fldtag CDATA> <!ELEMENT date (#PCDATA)> <!ATTLIST date fldtag CDATA> <!ELEMENT value (#PCDATA)> <!ATTLIST value fldtag CDATA> <!ELEMENT status (#PCDATA)> <!ATTLIST status fldtag CDATA> <!ELEMENT source (#PCDATA)> <!ATTLIST source fldtag CDATA> <!ELEMENT attrnum (#PCDATA)> <!ATTLIST attrnum fldtag CDATA> <!ELEMENT code (#PCDATA)> <!ATTLIST code fldtag CDATA> <!ELEMENT formulaid (#PCDATA)> <!ATTLIST formulaid fldtag CDATA> <!ELEMENT listid (#PCDATA)> <!ATTLIST listid fldtag CDATA> <!ELEMENT listtp (#PCDATA)> <!ATTLIST listtp fldtag CDATA> <!ELEMENT key (#PCDATA)> <!ATTLIST key fldtag CDATA> <!ELEMENT daycountbasis (#PCDATA)> <!ATTLIST daycountbasis fldtag CDATA> <!ELEMENT payment_freq (#PCDATA)> <!ATTLIST payment_freq fldtag CDATA> <!ELEMENT interpolation_method (#PCDATA)> <!ATTLIST interpolation_method fldtag CDATA> <!ELEMENT curvetp (#PCDATA)> <!ATTLIST curvetp fldtag CDATA> <!ELEMENT datatype (#PCDATA)> <!ATTLIST datatype fldtag CDATA> <!ELEMENT length (#PCDATA)> <!ATTLIST length fldtag CDATA> <!ELEMENT width (#PCDATA)> <!ATTLIST width fldtag CDATA> <!ELEMENT precision (#PCDATA)> <!ATTLIST precision fldtag CDATA> <!ELEMENT rebase (#PCDATA)> <!ATTLIST rebase fldtag CDATA> <!ELEMENT depth (#PCDATA)> <!ATTLIST depth fldtag CDATA> <!ELEMENT calnum (#PCDATA)> <!ATTLIST calnum fldtag CDATA> <!ELEMENT fromvalue (#PCDATA)> <!ATTLIST fromvalue fldtag CDATA> <!ELEMENT tovalue (#PCDATA)> <!ATTLIST tovalue fldtag CDATA> <!ELEMENT formid (#PCDATA)> <!ATTLIST formid fldtag CDATA> <!ELEMENT viewtype (#PCDATA)> <!ATTLIST viewtype fldtag CDATA> <!ELEMENT dimension (#PCDATA)> <!ATTLIST dimension fldtag CDATA> <!ELEMENT recalc_time (#PCDATA)> <!ATTLIST recalc_time fldtag CDATA> <!ELEMENT allow_saturday (true | '1')> <!ATTLIST allow_saturday fldtag CDATA> <!ELEMENT allow_sunday (true | false)> <!ATTLIST allow_sunday fldtag CDATA> <!ELEMENT multivalued (true | false)> <!ATTLIST multivalued fldtag CDATA> <!ELEMENT replicate (true | false)> <!ATTLIST replicate fldtag CDATA> <!ELEMENT check (true | false)> <!ATTLIST check fldtag CDATA> <!ELEMENT correct (true | false)> <!ATTLIST correct fldtag CDATA> <!ELEMENT unique (true | false)> <!ATTLIST unique fldtag CDATA> <!ELEMENT optional (true | false)> <!ATTLIST optional fldtag CDATA> <!ELEMENT profiling (true | false)> <!ATTLIST profiling fldtag CDATA> <!ELEMENT derived (true | false)> <!ATTLIST derived fldtag CDATA> <!ENTITY true CDATA "1"> <!ENTITY false CDATA "0">


Free your mind

  • Comment on Parsing an XML-like definition of an XML-like language to create a parser of the actual data in that language.
  • Download Code

Replies are listed 'Best First'.
Re: Parsing an XML-like definition of an XML-like language to create a parser of the actual data in that language.
by GrandFather (Sage) on Nov 16, 2006 at 19:27 UTC

    See section "2.8 Prolog and Document Type Declaration" in the "Extensible Markup Language (XML) 1.0 (Second Edition)" here.

    You might also like to follow on with Create XML from Schema. The modules XML::LibXML and SGML::DTDParse may help too. Note that I've not used these modules, they just came up when I Super Searched "XML DTD" in SoPW.

    Oh, and finding someone close by who knows XML will help.

    DWIM is Perl's answer to Gödel
      Grandfather, Thanks, it did not occur to me that DTD was an official acronym and so this helps me also to search CPAN for further choices in trying to string together the two parsers.


      Free your mind

        Grandfather, Thanks, it did not occur to me that DTD was an official acronym and so this helps me also to search CPAN for further choices in trying to string together the two parsers.

        (As a side note and mostly just for fun) so you did never look at the source for a web page? For this particular one I find:

        <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"> <!-- Took this out for IE6ites " +dtd" -->
Re: Parsing an XML-like definition of an XML-like language to create a parser of the actual data in that language.
by planetscape (Chancellor) on Nov 17, 2006 at 00:01 UTC

    To the best of my recollection, Altova's free XMLSpy had a utility that would allow you to take either a DTD or an XML Schema (tools exist to go from DTD to Schema) and generate a small sample XML file that corresponds to that DTD/Schema. Whether such would be sufficient to generate a parser for that particular XML file, I cannot say.

    However, you might have a larger problem. When I open your DTD in my installed version of XMLSpy, I get the following error:

    This file is not well-formed: % expected.

    IMHO, you probably need to address this issue (hopefully just an error of copy & paste) before addressing larger issues of parsing DTDs, Schemas, or XML.

    Since XML Schemas are themselves written in XML (and well-tested parsers already exist for XML), I would try (1) Fixing the error described above; (2) Converting DTD to XSD; (3) using an existing XML Parser on the result.

    I have included a few links (unchecked to see if they are still "up"), that may be of use:

    James Clark's trang

    a Java-based translator

    a Perl program that can translate DTDs to XSD, link 1

    a Perl program that can translate DTDs to XSD, link 2



      Planetscape, thanks for taking so much trouble to map out a strategy, I will certainly start with making sure I have a DTD parser and a version of the DTD that work together bug-free and see where that gets me along the road you describe. The vendor also just warned me of a particular bug with the data so it´s going to be sort of fun even getting past step 1 of your plan, but I´ll give it a go.


      Free your mind

Re: Parsing an XML-like definition of an XML-like language to create a parser of the actual data in that language.
by Fletch (Chancellor) on Nov 16, 2006 at 20:08 UTC

    You're a bit confuzzled about the purpose of a DTD. A DTD describes what a particular (X|SG)ML document will contain, but it's not required to generate a document conforming to that DTD. It's more for use by other tools to determine if a particular well-formed (X|SG)ML document (i.e. one which conforms to the syntax requirements) is a valid document (i.e. has the expected semantic contents and layout).

    There are tools (SGML aware editors and the like) which can use a DTD to provide more intelligent editing (perhaps giving dropdown lists for attribute values, for example), but unless you're writing one of those you really just need to understand what the DTD is describing and generate your output according to its expectations.

Re: Parsing an XML-like definition of an XML-like language to create a parser of the actual data in that language.
by idsfa (Vicar) on Nov 17, 2006 at 06:11 UTC

    As has been noted, the DTD is not really important for trying to parse the input file. You might want to look at XML::Simple as the easiest way to turn that XML structure into a perl data hash.

    The intelligent reader will judge for himself. Without examining the facts fully and fairly, there is no way of knowing whether vox populi is really vox dei, or merely vox asinorum. — Cyrus H. Gordon
      Thanks and yes I´m tempted enough by this argument because it avoids the potential bug in the DTD and might just have the simplicity to overlook the vendor-reported bug in the XML itself, so I´ll give it an early whirl.


      Free your mind

The case for an anti-lexer - Re: Parsing an XML-like definition of an XML-like language to create a parser of the actual data in that language.
by Moron (Curate) on Nov 27, 2006 at 13:49 UTC
    A note to say thanks for all the input. Planetscape´s was the most inspiring because as it turned out XML::Simple wouldn't compile under version 5.005_003 so, after trying to 'fix' it but finding the task too lengthy I soon found myself seeking approval to spend the 8 hours budget writing a parser from scratch. Fortunately, it was given and the result, one working day at this client later is shown below.

    A couple of notes first:

    1) this is only tested against the particular XML it needed to parse.

    2) I am about to modify it so that it has SAX-like support except that I need to hand-roll that as well for the same reasons. The unmodified version can be regarded as the subroutine GetTag shown below because the actual data is one huge tag. Apart from this being non-generic, I will want to process at one level lower than that and SAX-style looks the best kind of algorithm.

    3) Owing to the tag - body - terminator structure, I found it easier to write an antilexer than a normal one, i.e. it returns everything up to a choice of matched terminating patterns instead of matching the immediate content. This appears to have a number of advantages over a positive lexer:

    - no need for a token table

    - language independent

    - trivialises the lexer and thrower routines even more than usual with Perl

    An extra mechanism is a Step routine to step over the terminating expression which being logically known didn't usually need to be lexed specifically. This is no downside in terms of overall code length, however.

    Note also that this version takes a file handle as argument, but if that is omitted, it will parse $_, which it does anyway at lower levels.

    #!/usr/bin/perl -w # # @(#) 1.0 # # Author: (Moron) # # Versie 1.0 16 oktober 2006 use strict; use locale; use Time::localtime; use lib $ENV{AC_PERLLIB}; use Env; use Utilities; use IPC::Open3; use POSIX ":sys_wait_h"; # ... main program logic omitted as being off-topic # ... sub GetTag { my $fh = shift; my $pastNoise = 0; my ( $tag, $sts, $cnt, $twixt ); do { # walk past comment tags e.g. <?version ... > Throw( $fh ); # walk past whitespace and "\n"s /^\</ or XMLerror( 'Format' ); Step(); # step over one char ( $tag, $sts ) = AntiLex( $fh, '\W' ); # collect data until \ +W # and then walk there unless( $pastNoise = $tag ) { ( $cnt, $sts ) = AntiLex( $fh, '\>' ); /^\>/ or XMLerror( 'Comment Unclosed By > ' ); Step(); } } until ( $pastNoise ); Throw( $fh); my $assignments = {}; ASSMNT: for ( my $assco = 0; !/^\>/; $assco++ ) { my $kwd; ( $kwd, $sts ) = AntiLex( $fh, '\W', ); unless ( $kwd ) { # only valid way is no assignments ( $assco || !/^\>/ ) and XMLerror( 'Format' ); last ASSMNT; } Throw( $fh ); ( $cnt, $sts ) = AntiLex( $fh, '\=' ); ( $cnt || !$sts ) and XMLerror( 'Format' ); Step(); Throw( $fh ); my $val = ''; # error-check for something before quotes ( $twixt, $sts ) = AntiLex( $fh, '\"' ); Step(); $twixt and XMLerror( 'Format' ); do { # quotes loop ( $cnt, $sts ) = AntiLex( $fh, '\"', '\\\"' ); $sts or XMLerror( 'Unclosed Quote' ); $val .= $cnt; length() or $_ = <$fh>; } until ( /^\"/ ); # i.e. include \" as part of string Step(); $assignments -> { $kwd } = $val; Throw( $fh ); length() or XMLerror( 'Unexpected End Of XML' ); } Step(); Throw( $fh ); # case of simple value for current tag ... my $simple = ''; /^</ or ( $simple, $sts ) = AntiLex( $fh, '<' ); my @subtags = (); # collect nested tags to current tag while ( !$simple && /^\<(.)/ && ($1 ne '/' ) ) { push @subtags, GetTag( $fh ); Throw( $fh ); } AntiLex( $fh, '<' ); # * ... see comment below if ( /^\<\/(\w+)\>(.*)/ ) { ( $1 eq $tag ) or XMLerror( 'Tag Nesting' ); $_ = $2; # walk past closing tag. $simple and return { $tag => $simple }; return { $tag => { ASSMNTS => $assignments, SUBTAGS => \@subtags } }; } XMLerror( "Format" ); # anything okay between '*' and here was eli +minated from suspicion. } # subroutine to walk past whitespace. sub Throw { return ( AntiLex( shift(), '\S' ) ); } sub Step { # like chopping $_ but from the LEFT of the string s/^(.)//; return $1; } sub XMLerror { my $reason = shift; my @ct = split( "\n" ); die "XML $reason Error: $ct[0]"; } sub AntiLex { # - walk thru $_, reloading from optional fh if pres +ent, until # matching one of a list of regexps # - eats the returned content from $_ ready for # repeated calls to this routine by the calling parser # # to parse positively just give it negative regexps. # the purpose is to roll up a lexer and thrower into a trivial # piece of code. # - SYNOPSIS: ( $content, $status ) = AntiLex ( [fh], { pattern, . +.. } ) my $fh = shift; # undef means simply: don't reload emptied $_ from + file my $contents = ''; while ( 1 ) { unless( defined() && length() ) { defined( $fh ) and $_ = <$fh>; $_ or return ( $contents, 0 ); chomp; } for my $pat ( @_ ) { ( /^($pat)(.*)$/ ) and return ( $contents, 1 ); } $contents .= Step(); } }


    Free your mind

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://584569]
Approved by BaldPenguin
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others exploiting the Monastery: (6)
As of 2017-08-17 19:58 GMT
Find Nodes?
    Voting Booth?
    Who is your favorite scientist and why?

    Results (292 votes). Check out past polls.