I'm working on a Perl script that takes apart nasty legalese, stores it in a database, and reassembles it on request based on parameters.
So far, I'm good on everything except the parsing. The text is straight ascii in the form:
(a)blahblahblahblah
(1)blahblahblahblah
(A)blahblahblahblah
(i)blahblahblahblah
Where each multi-line section is:
- space Indented (but not in varying widths)
- Begins with an indicator in parens
and the indicator is in the progression of
a-z, each with possible "children" of 1-???, each with possible children of A-Z, each with possible children of (roman numerals).
The parser needs to be able to identify each section, as well as understand it's parentage. (i.e. b.2.C.iii would have to know that it was not only iii, but also a "child" of b.2.C)
I wrote up a chunky little parser that does the deed, but I've run into complications:
- It appears that some text sections also have "lists", which are denoted by sections starting (N) where N is a decimal number. These lists shouldn't be pulled out, but the parser can't distinguish them a subsection if they fall in the wrong spot.
- I currently "fudge" roman numeral i (to distinguish it from the letter "i"), and I'm worried that as soon as my parser hits the text, it will break.
As far as I can tell, the best way to deal with this is to use a real parser that will evaluate the entire text rather than considering each line as mostly distinct as I do now. Is this a task for Parse::RecDescent? The documentation really seems to assume experience with parsers, does anyone have a good starting point? Has anyone done anything similar to this?