|laziness, impatience, and hubris|
Re: Make a Syntax checkerby pemungkah (Priest)
|on Feb 28, 2012 at 00:23 UTC||Need Help??|
Let me make a suggestion. It looks like that what you need is, for lack of a better term, a pragmatic parser.
Such a parser expects its input to be bad: missing stuff, extra crud, bad case...Unfortunately the standard parsing mechanisms and modules don't handle this well - in general they expect a set of tokens, easily-distinguished parts of the input, arranged in a specific order defined by a grammar. (Most standard parsing algorithms are big on the "input will be perfect and I will complain if not"; the standard parsing mechanisms handle errors badly.
In your case, you have the likelihood of both errors in your tokens (casing and quoting) and grammatical errors in the input (missing parens, misordered tokens, etc.). There are several approaches to this:
First, redefine the input language as something that is easier to parse. You have reasonable rules as to what's allowed for each property, so define the simplest possible input for it:
I'm not sure about #3, as I didn't quite get what values option1 is allowed to have, nor whether they have a different meaning than the value does. If so, you could simply make "3" be "3v" or "3o" instead. Notice that we have eliminated all the stuff that is hard for humans: no quoting, no paren matching, no case-sensitive stuff. This is also really easy to parse: read the line, split on /\s*:\s*/ (a colon with or without spaces around it on either side), and slap the pairs into a hash. You can then write a file in the expected, paren-filled syntax by substituting the values into a template. I've specifically added a keyword to denote that you wanted to have an empty value: "empty".
Since this strips off all the crud, it's easy to diagnose errors too: "You forgot item 2, which should be a number from 1 to 7." "You didn't supply a value for 3, which needs to be <whatever> or 'empty'". It's ultrasimplified, so the user should have little trouble writing it.
The over-engineered approach involves creating a parser for the original language by adapting one of the standard parsing algorithms to add enough backtracking to handle the most common errors. You'll need to make the tokenization smarter (I expected a quote but see a "1". Since I'm at the position where I'd expect a quoted string, I'll insert a quote and continue), and the grammar will actually need to "parse" the most common errors and note them for a final set of diagnostics. For example, I've just received an open paren; this might be the start of a nested list in this property, or the previous property might have a missing close paren. I'll assume this is a nested property for now, and mark the input token string here. Oh look! The next token is "property2", so my guess was wrong. I'll push the tokens back to the mark onto the input stream again, with a close paren in front of it, and then rewind the computation to that point.
You can use your existing files to train your parser by starting out assuming everything is perfect, and throwing errors that you'll add to the pragmatic checking and assumptions that the backtracking will make. Obviously this is stupid hard. However, you would learn boatloads in the process of writing the second approach; certainly a YAPC-talk worth. This, however, isn't pragmatic: it assumes you have huge time resources to throw at this problem.
Let's combine these strategies to come up with something that's the best of both. Define another grammar to be used with a standard parsing algorithm, but in this one, you use the same keyword names as you did in the simplified language. Now you'll run multiple passes over the input, cleaning up the input a little more each time.
Pass 1: remove all parens, colons, and quotes. If you find two quotes together, a quote and a close paren, or a quote and a newline, change that to the keyword "empty" (and make a note of that). Now write a parser for this language, which can actually be pretty strict -- very close to the language in option 1, in fact, since you've reduced the input variation a lot; validate that the data is good; then rewrite the file from a template using the values you got! If you're careful about your wording, you should be able to relate the values back to the original input (e.g., "for item 2, you're missing a close quote or close paren") without having to be ultra-precise or letting the user know that you stripped out all those parens and quotes and stuff. You might note during the first pass if you got any funny-looking tokens, and emit warnings ("missing a quote on line 5: '18.104.22.168<-- HERE; inserting and continuing") as appropriate.
I'm handwaving details, but it's really not all that bad. The other suggestions (particularly Marpa) should be quite helpful in implementing this option. You'll need to write a little cleanup loop, with the appropriate code to push cleanup notes onto an array, which you'll print after the first pass; a second pass which will take the tokenized, simplified input and hand it to one of the standard parsing algorithms; and a final bit of code that takes the parsed input, validates that the data we got is OK, and then rewrites the input file using Text::Template or Template::Toolkit to create a definitely-OK, syntactically-valid final file. Obviously, if either of the first two loops gets to a point where it can't continue (pass one gets no input or finds illegal characters, pass two can't parse what it gets), we don't rewrite the output file but flag it as needing human attention.
This should solve a good 80-90% of your problem, and show you the rest that it can't handle. Minor edit: duplicated word.