http://www.perlmonks.org?node_id=799295

periapt has asked for the wisdom of the Perl Monks concerning the following question:

This is a difficult question to ask since I'm not sure of the terminology. Basically I am looking for a solution to parse what I would call a "loose" XML grammar. This means that data is contained between nested tags just as XML but without the requirement to specify the sequence of subtags.

I'm a novice with regards to XML but it seems that what I'm looking for a more generalized grammar parser?

For example, this would be allowed:
<toptag> <subtag1>element #1</subtag1> <subtag2>element #2</subtag2> <subtag3>element #3</subtag3> <toptag> <toptag> <subtag1>element #3</subtag1> <subtag2>element #2</subtag2> <subtag1>element #1</subtag1> <toptag> <toptag> <subtag2>element #2</subtag2> <subtag2>element #2</subtag2> <subtag2>element #2</subtag2> <toptag>

The trouble is that the subtags could occur in any order and in any number from 0 to unbounded.

Essentially, I want to build a hash of these tag elements and then parse through the hash to build an XML compliant output.

This is kind of out of my area and I'm not sure of that I'm asking the right questions when I research this. Any suggestions would be appreciated.

Further clarification:
Maybe this will help clarify. Consider it this way. A person is writing a text document. They will tag various words or phrases of that document using a predefined set of tags. Different parts of the document may contain related tags. For example,

<statement> This is the statement of <person id="001"><name>Joe Smith</nam +e></person>. His mothers name is <parent><name>Betty</name></parent>. Joe is <person id="001">< +age>15</age></person> years old. </statement>


The person {name/age} sub-elements could occur in any order. In fact, the parent/person elements could occur in any order. There might also be multiple person tag sets.

Ultimately, I want to parse the final document, build a hash from the tags and then process the hash to combine all the elements associated with person id="001" into a single data structure.

Update:
I've received several good suggestions and some good advice. XML::Simple seems the most promising at the moment. Of course, I'm open to more suggestions and I'd love to hear from someone who has tackled this problem before.

Well, I've got some exploration to do ...

PJ
use strict; use warnings; use diagnostics;

Replies are listed 'Best First'.
Re: tagged text parser
by ikegami (Patriarch) on Oct 05, 2009 at 16:54 UTC
    What you describe as loose XML is just XML. I recommend XML::LibXML, although XML::Twig is great for transforming documents.
Re: tagged text parser
by bart (Canon) on Oct 05, 2009 at 17:13 UTC
    It's just plain XML.

    I'd recommend you to try XML::Simple. It throws away some data, such as tag order, I think, so try it and see if it works for you. It's easy enough to use so that shouldn't take more than just a few minutes.

    There are more elaborate (and less easy to use) XML modules, so if you don't like XML::Simple, come back and we can discuss those that are the best compromise between ease of use, and features.

      Hmmm,

      This has promise. Preliminary tests are positive. Thanks a lot for the suggestion. I wasn't even sure if I was describing the problem correctly Most of he IT types I had discussed this with just sort of looked at me like I was speaking martian.

      This may get me started on some proof of concept. Thanks

      PJ
      use strict; use warnings; use diagnostics;
Re: tagged text parser
by BioLion (Curate) on Oct 05, 2009 at 16:50 UTC

    I don't know ( much about xml etc... ) or if there is a specific module for this kind of 'loose' parsing , but i suspect that what you want could be hacked from Text::Balanced and it's extract_bracketed method?

    Hmmm... on second thoughts, maybe Parse::Gnaw would be a better avenue? Or Parse::RecDescent?

    There must be a better way!?!

    Update: SPAG

    Just a something something...
Re: tagged text parser
by roboticus (Chancellor) on Oct 06, 2009 at 13:56 UTC
    periapt:

    While it does appear to be plain XML, you'll want to be certain. What sorts of situations are going to be considered an error (e.g. tags out of order, mismatched tags, invalid characters, ...)? Are there any special cases you'll need to handle?

    The devil is in the details, as they say. So if you keep to the same rules and conventions as XML, things will be pretty simple using the suggestions you've already received. However, if you have to do any special case handling, your life can quickly become difficult.

    ...roboticus
      Wise words roboticus,

      For now I'm just working on proof of concept. Is this possible in a coherent, stable way. I expect that the finished product would hold to the rules and conventions of XML with regards to validity but not strictly since we've already determined that strict adherence is unworkable. (for example enforcement through schema).

      I'm hoping that by sticking to XML, I can reduce the amount of post processing (or edge cases) to a manageable level as I move farther along. However, I still have to flesh out the idea some more.

      Thanks

      PJ
      use strict; use warnings; use diagnostics;

        I want to underscore what roboticus said. If you use XML it can be straightforward and robust. If not, it will likely be, at least sometimes, hellish and difficult to explain to customers why it breaks randomly. Part of the point of XML is that if it's not valid, it's not XML. It should be considered unacceptable garbage.

        It is not difficult to write, read, parse, and validate against DTDs XML if you use something like XML::LibXML. If you don't intend to go that route, I would strongly recommend not using pretend XML. Use a different, real, format which is perhaps easier to sling like YAML, or JSON. So you know, I'm not trying to be critical. I'm trying to save you (and those who will inherit your code base) pain. Using a fake version of a real data format is like writing your own custom format from scratch except more confusing.