Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

Parsing the Law

by swiftone (Curate)
on May 30, 2001 at 00:02 UTC ( [id://84058]=perlquestion: print w/replies, xml ) Need Help??

swiftone has asked for the wisdom of the Perl Monks concerning the following question:

I'm working on a Perl script that takes apart nasty legalese, stores it in a database, and reassembles it on request based on parameters.

So far, I'm good on everything except the parsing. The text is straight ascii in the form:

 (a)blahblahblahblah
 (1)blahblahblahblah
 (A)blahblahblahblah
 (i)blahblahblahblah
Where each multi-line section is:
  • space Indented (but not in varying widths)
  • Begins with an indicator in parens
and the indicator is in the progression of a-z, each with possible "children" of 1-???, each with possible children of A-Z, each with possible children of (roman numerals).

The parser needs to be able to identify each section, as well as understand it's parentage. (i.e. b.2.C.iii would have to know that it was not only iii, but also a "child" of b.2.C)

I wrote up a chunky little parser that does the deed, but I've run into complications:

  • It appears that some text sections also have "lists", which are denoted by sections starting (N) where N is a decimal number. These lists shouldn't be pulled out, but the parser can't distinguish them a subsection if they fall in the wrong spot.
  • I currently "fudge" roman numeral i (to distinguish it from the letter "i"), and I'm worried that as soon as my parser hits the text, it will break.
As far as I can tell, the best way to deal with this is to use a real parser that will evaluate the entire text rather than considering each line as mostly distinct as I do now. Is this a task for Parse::RecDescent? The documentation really seems to assume experience with parsers, does anyone have a good starting point? Has anyone done anything similar to this?

Replies are listed 'Best First'.
Re: Parsing the Law
by merlyn (Sage) on May 30, 2001 at 01:05 UTC
Re: Parsing the Law
by Masem (Monsignor) on May 30, 2001 at 00:46 UTC
    This might be oversimplifying the situation, but it's the only reasonable approach I can think of:

    Do two operations: split on /\s*\([a-zA-Z1-90]*\)/ in @texts, and then do a /\s*\(([a-zA-Z1-90]*)\)/g into @sections. Shift off the first element of @texts (should be whatever comes before the first section indicator, which you suggest is null), and then for array element $i, $texts[$i] corresponds to $section[$i].

    Now, you simply need to work out the tree structure for this. Create an array of subroutines that parse the appropriate section number at the given level. Eg:

    sub major_section { $sec = shift; if ( $sec =~ /^([A-Z])$/ ) { return ( ord $1 - ord 'A' + 1 ); } else { return 0; } }
    For each @section in turn, start at one level below the current one in this coderef array and see if it matches; if so, it's at that level, otherwise move backwards in the coderef array until you hit a match. If you don't hit a match, then you want to try for your list starter (which must begin with 1) and if it's a list, run the list until the next section changes.

    The only problem is a case like the following:

    A. 1. 2. a. 1. list data 2. list data 3. 4.
    without more formal guidelines from the original format, you will not be able to determin where A.2.a's list stops and section A.3 begins. Also, this assumes that any other text within parenthesis has whitespace and thus does not look like section headers.


    Dr. Michael K. Neylon - mneylon-pm@masemware.com || "You've left the lens cap of your mind on again, Pinky" - The Brain
      without more formal guidelines from the original format, you will not be able to determin where A.2.a's list stops and section A.3 begins.

      Which is pretty much my problem...the original format is written for humans to read (well, politicians), so as far as I can tell there is no syntactical help, which is why I was hoping for an overarching parser to check for all errors on assumptions, but I can see that that isn't going to handle all cases.

      Back to the drawing board I guess.

Re: Parsing the Law
by cLive ;-) (Prior) on May 30, 2001 at 00:55 UTC
    If you're doing it moduleless, why not:
    • read all "(marker) associated text" blocks into a hash
    • parse hash keys with rules to work out placement
    • create new node hash to show parentage

    ie

    # use this in your rule tests for position my %structure = ( 1 => {qw(a b c d e f g h i etc...)}, 2 => {qw(1 2 3 4 5 6 7 8 9 10 11 12 13 etc...)}, 3 => {qw(A B C D E F G H I etc...)}, 4 => {qw(i ii iii iv v vi vii viii ix x xi etc...)} ); my %legalese; # hash to store text/markers my $key = 2; # key marker to keep order ('1' is tree root) # read into hash while (s/\((\w+)\)(.*?)(\(\w+\))/$3/s) { $legalese{$key}{'list_id'} = $1; $legalese{$key}{'content'} = $2; $key++; } my %node = ( 1 => ''); # node structure tree # initialise with single node # to denote top of page # now work on rules for tree depth my $last_node; # go through markers in order parsed in for (sort {$a <=> $b;} keys %legalese) { # too tired to try to create rule set :) # but set nodes as follows $node{$_}{'parent'} = 'whatever parent node is' # either '1', $last_node, or somewhere in between # using %structure as your guide $last_node = $_; }
    The above is in no way final code, but that's the sort of approach I'd take.

    Hope that's enough pointers. By knowing what type the previous node was, you can create a valid rule for the next node by asking:

    • is it of the child type expected?
    • is it of the same type, incremented one?
    • is it of parent type, incremented one?
    • is it of grandparent type, incremented one?
    • if out of structure expected, append to previous node (ie, embedded list)
    etc, etc...

    cLive ;-)

      That is pretty much exactly what I have so far (except that I parsed filter-style, rather than reading it all in at once). It fails to handle lists utterly however, as it sees them as sections...

      By reading them all in as you have done, I could conceivably backtrack and try again if assuming it was a section led to an error, but if this is better done with a "Real" parser I don't want to reinvent the wheel. However, since I have almost no "real" parser experience, I don't know if this is an appropriate situation for one or not.

Re: Parsing the Law
by swiftone (Curate) on May 30, 2001 at 00:11 UTC
    One more complication I neglected to mention: Sometimes the first subsection is placed on the same line:
     (A)blahblahblah (i)blahblahblah
     (ii)blahblahblah
    
Re: Parsing the Law
by Anonymous Monk on May 30, 2001 at 01:20 UTC
    How about using whitespace as your leader? Make it like Python(cough*cough)! Transform whitespace in INDENT,DEDENT characters and parse that.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://84058]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others musing on the Monastery: (6)
As of 2024-04-18 15:32 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found