![]() |
|
The stupid question is the question not asked | |
PerlMonks |
A demanding parserby gmax (Abbot) |
on Jan 25, 2002 at 14:58 UTC ( #141468=perlmeditation: print w/replies, xml ) | Need Help?? |
For a few weeks, I have been working at a parser for PGN files (Don't get me wrong. I wasn't working on it 8 to 5, since I have to dedicate most of my time to my job. Just giving some thoughts to it from time to time). PGN files contain chess games, and the acronym stands for Portable Game Notation. There is a detailed Standard governing the rules to create such documents, therefore making life uneasy for the poor guy who wants to write a parser. I started doing this because I need a parser to enter formalized PGN data into a database. Inserting the data is straightforward (well, almost, but there are problems that are related to database theory. Beyond that point, it's child's play for DBI). What is demanding is parsing the file. There is a Chess::Pgn module in the CPAN, but it does not do anything but a basic reading, and in addition it is buggy (it misses the last game in each PGN file). I contacted the author some weeks ago, but no significant news has come up. Then, I was on my own. I have discussed the matter at length with hsmyers, and after some prototypes I completed a parser that meets all the requirements. I think it's time to explain the problem. Structure of a PGN fileThe following text is a PGN game with almost all the problems my parser can even meet.There are two parts. In the first one, there are very simple "tags", with a key and a value. Piece of cake. The second part is the game, which you can take as a whole and consider your job done, as the CPAN module does, or you can read the Standard and extract the information that I need to feed a database with. Chess games in a PGN are made of, among other things: - move numbers (digits plus dots : 1. 2. 3.); - the moves themselves (e4 e5 Nf3 Nc6 and so on); - braced comments {like this}; - comments to the end of the line, starting with a semicolon; - escaped lines, starting with a '%'; - Recursive Annotation Variation, which are chunks of moves and comments, enclosed by parentheses (the word "Recursive" here means that they can be deeply nested and very often they are); - Numeric Annotation Glyphs (a '$' sign followed by a numerical code); - And finally, they can have errors, such as plain text without braces or unbalanced braces or parentheses. The desired outcomeThe purpose of a PGN parser is to take into consideration the above rules and store the relevant information into a suitable data structure, for future consumption by a chess engine.The parser I have created, given the above game, produces the following code (slightly adapted from Data::Dumper output): Notice that the moves are stored into an array, and the comments recorded with reference to the move where they belonged ('3b' = move 3, black; '6w' = move 6, white, and so on) so that the game, once stored, can be easily recreated. Also the parser takes care of errors (unbalanced brace and plaintext are stored with move number references). Parser architecture - first attemptA simple attack based on RegExes seemed doomed to failure, due to two problems, i.e. nested parentheses and the need of assigning the comments to the appropriate move number.Therefore I built a hybrid parser, scanning the text char by char and testing the apropriate RegEx for each case. Something along the lines of this (simplified) code: Here $REnumber and $REmove are regular expressions describin how a number and a chess move should be (see below), while _find_end_symbol(), _find_EOL() and _find_chunk() are functions that scan the char array for the desired portion of text, and return the text or a failure code (I didn't include here for simplicity.) It works. It has the two attributes that an algorithm should have to be reliable, i.e. (1) I know what it does and (2) it does what I want. Philosophical doubtsI was satisfied with my work. I had achieved something that was not trivial, with a cunning solution. I rewarded myself with some entertaining reading, and I was enjoying Oscar Wilde aphorisms, when I came across this one: I have the simplest tastes. I am always satisfied with the best. This observation triggered some doubts.Thinking about the little mechanism that I had built, it looked suspiciously like the internal working of a Regular Expression engine. Search for the beginning of a known RegExp and then see if the rest of the expression matches. I had the distinct feeling that I was reinventing a well known wheel, even though I didn't see which one. The PerlMonks effectI decided to write down the details of my findings and ask for advice. Thus, I went through the recommended routine. I searched the Monastery archives, and I searched the FAQs. I found nothing. At least, nothing explicitly related to parsing PGN. But I knew that the monks would look at it from a more pragmatic view, so I tried my best to forget about what I had done so far and looked at the Perl RegEx resources.I had previously tried a combination of \G and /g without achieving very much. So I looked at the FAQs for these bits, just for the sake of being able to say "I looked at it. I haven't found anything, so please give me some help." And then I saw the light. In perlfaq6 there is a snippet, presented as it was an ancient carved stone with no relationship whatsoever vith real life. It was the key to the real solution. Previously, I had only tried something like And I could catch every piece of the text, but without being able to tell which one I got!. Now, with the redo trick, the parser could be rewritten this way (again, it's simplified for the purpose of this post): This second parser is much more "perlish" than the previous one. It's 240 lines shorter and slightly faster. It will produce the same data structure as the previous one, dealing with comments and errors in a very efficient way, at the price of using an external module (thanks to TheDamian for Regexp::Common). Lessons learned1. preparing a post for PerlMonks is an enlightening experience, which can lead to a solution even before posting (!!)2. My process of auto-RTFMing in this case was made possible due to my previous knowledge of the "\G /g" mechanism, which triggered for the research of a RegEx solution. I wonder what I would have done had such a notion not stick to the back of my brain when I was reading the Camel book. I think that I should keep this fact in mind when I advise somebody. A final pleaI know that there are some real RegEx wizards in the Monastery. So I join the meditation contents to a request for other Monks experience:Is there any way of avoiding the external module and catch an arbitrary number of nested parentheses with a "normal" Regex? I know that the Owl book says it can't be done, but I would like to put my mind at rest on this issue. Thanks for your attention and for any piece of advice. update (31st Jan 2002 I found the solution! Thanks to Juerd for showing me the target, to TheDamian for providing the weapon and to blakem for adjusting my aim. Thus armed, as somebody would point out, I can shoot myself in the foot. :) The parser with this adjustment is twice as fast as the version using the module. Considering that it has to read 1.3 million records, it is a substantial improvement. update (09th Feb 2002 The parser is now in the CPAN _ _ _ _ (_|| | |(_|>< _|
Back to
Meditations
|
|