Parsing the Law

swiftone has asked for the wisdom of the Perl Monks concerning the following question:

I'm working on a Perl script that takes apart nasty legalese, stores it in a database, and reassembles it on request based on parameters.

So far, I'm good on everything except the parsing. The text is straight ascii in the form:

 (a)blahblahblahblah
 (1)blahblahblahblah
 (A)blahblahblahblah
 (i)blahblahblahblah

Where each multi-line section is:

space Indented (but not in varying widths)
Begins with an indicator in parens

and the indicator is in the progression of a-z, each with possible "children" of 1-???, each with possible children of A-Z, each with possible children of (roman numerals).

The parser needs to be able to identify each section, as well as understand it's parentage. (i.e. b.2.C.iii would have to know that it was not only iii, but also a "child" of b.2.C)

I wrote up a chunky little parser that does the deed, but I've run into complications:

It appears that some text sections also have "lists", which are denoted by sections starting (N) where N is a decimal number. These lists shouldn't be pulled out, but the parser can't distinguish them a subsection if they fall in the wrong spot.
I currently "fudge" roman numeral i (to distinguish it from the letter "i"), and I'm worried that as soon as my parser hits the text, it will break.

As far as I can tell, the best way to deal with this is to use a real parser that will evaluate the entire text rather than considering each line as mostly distinct as I do now. Is this a task for Parse::RecDescent? The documentation really seems to assume experience with parsers, does anyone have a good starting point? Has anyone done anything similar to this?

Comment on Parsing the Law

Replies are listed 'Best First'.
Re: Parsing the Law by merlyn (Sage) on May 30, 2001 at 01:05 UTC
"The Damian" has already solved this in Text::Autoformat. I suggest stealing ideas and/or code from there. -- Randal L. Schwartz, Perl hacker	[reply]
Re: Parsing the Law by Masem (Monsignor) on May 30, 2001 at 00:46 UTC
This might be oversimplifying the situation, but it's the only reasonable approach I can think of: Do two operations: split on `/\s$[a-zA-Z1-90]$/` in @texts, and then do a `/\s$([a-zA-Z1-90])$/g` into @sections. Shift off the first element of @texts (should be whatever comes before the first section indicator, which you suggest is null), and then for array element $i, `$texts[$i]` corresponds to `$section[$i]`. Now, you simply need to work out the tree structure for this. Create an array of subroutines that parse the appropriate section number at the given level. Eg: `sub major_section { $sec = shift; if ( $sec =~ /^([A-Z])$/ ) { return ( ord $1 - ord 'A' + 1 ); } else { return 0; } }` [download] For each @section in turn, start at one level below the current one in this coderef array and see if it matches; if so, it's at that level, otherwise move backwards in the coderef array until you hit a match. If you don't hit a match, then you want to try for your list starter (which must begin with 1) and if it's a list, run the list until the next section changes. The only problem is a case like the following: `A. 1. 2. a. 1. list data 2. list data 3. 4.` [download] without more formal guidelines from the original format, you will not be able to determin where A.2.a's list stops and section A.3 begins. Also, this assumes that any other text within parenthesis has whitespace and thus does not look like section headers. Dr. Michael K. Neylon - mneylon-pm@masemware.com \|\| "You've left the lens cap of your mind on again, Pinky" - The Brain	[reply] [d/l] [select]
Re: Re: Parsing the Law by swiftone (Curate) on May 30, 2001 at 01:10 UTC
without more formal guidelines from the original format, you will not be able to determin where A.2.a's list stops and section A.3 begins. Which is pretty much my problem...the original format is written for humans to read (well, politicians), so as far as I can tell there is no syntactical help, which is why I was hoping for an overarching parser to check for all errors on assumptions, but I can see that that isn't going to handle all cases. Back to the drawing board I guess.	[reply]
Re: Parsing the Law by cLive ;-) (Prior) on May 30, 2001 at 00:55 UTC
If you're doing it moduleless, why not: read all "(marker) associated text" blocks into a hash parse hash keys with rules to work out placement create new node hash to show parentage ie # use this in your rule tests for position my %structure = ( 1 => {qw(a b c d e f g h i etc...)}, 2 => {qw(1 2 3 4 5 6 7 8 9 10 11 12 13 etc...)}, 3 => {qw(A B C D E F G H I etc...)}, 4 => {qw(i ii iii iv v vi vii viii ix x xi etc...)} ); my %legalese; # hash to store text/markers my $key = 2; # key marker to keep order ('1' is tree root) # read into hash while (s/$(\w+)$(.*?)($\w+$)/$3/s) { $legalese{$key}{'list_id'} = $1; $legalese{$key}{'content'} = $2; $key++; } my %node = ( 1 => ''); # node structure tree # initialise with single node # to denote top of page # now work on rules for tree depth my $last_node; # go through markers in order parsed in for (sort {$a <=> $b;} keys %legalese) { # too tired to try to create rule set :) # but set nodes as follows $node{$_}{'parent'} = 'whatever parent node is' # either '1', $last_node, or somewhere in between # using %structure as your guide $last_node = $_; } [download] The above is in no way final code, but that's the sort of approach I'd take. Hope that's enough pointers. By knowing what type the previous node was, you can create a valid rule for the next node by asking: is it of the child type expected? is it of the same type, incremented one? is it of parent type, incremented one? is it of grandparent type, incremented one? if out of structure expected, append to previous node (ie, embedded list) etc, etc... cLive ;-)	[reply] [d/l]
Re: Re: Parsing the Law by swiftone (Curate) on May 30, 2001 at 01:00 UTC
That is pretty much exactly what I have so far (except that I parsed filter-style, rather than reading it all in at once). It fails to handle lists utterly however, as it sees them as sections... By reading them all in as you have done, I could conceivably backtrack and try again if assuming it was a section led to an error, but if this is better done with a "Real" parser I don't want to reinvent the wheel. However, since I have almost no "real" parser experience, I don't know if this is an appropriate situation for one or not.	[reply]
Re: Parsing the Law by swiftone (Curate) on May 30, 2001 at 00:11 UTC
One more complication I neglected to mention: Sometimes the first subsection is placed on the same line: (A)blahblahblah (i)blahblahblah (ii)blahblahblah	[reply]
Re: Parsing the Law by Anonymous Monk on May 30, 2001 at 01:20 UTC
How about using whitespace as your leader? Make it like Python(cough*cough)! Transform whitespace in INDENT,DEDENT characters and parse that.	[reply]


P is for Practical
	PerlMonks