Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"

A demanding parser

by gmax (Abbot)
on Jan 25, 2002 at 14:58 UTC ( #141468=perlmeditation: print w/replies, xml ) Need Help??

For a few weeks, I have been working at a parser for PGN files (Don't get me wrong. I wasn't working on it 8 to 5, since I have to dedicate most of my time to my job. Just giving some thoughts to it from time to time). PGN files contain chess games, and the acronym stands for Portable Game Notation. There is a detailed Standard governing the rules to create such documents, therefore making life uneasy for the poor guy who wants to write a parser.
I started doing this because I need a parser to enter formalized PGN data into a database. Inserting the data is straightforward (well, almost, but there are problems that are related to database theory. Beyond that point, it's child's play for DBI). What is demanding is parsing the file.
There is a Chess::Pgn module in the CPAN, but it does not do anything but a basic reading, and in addition it is buggy (it misses the last game in each PGN file). I contacted the author some weeks ago, but no significant news has come up.
Then, I was on my own. I have discussed the matter at length with hsmyers, and after some prototypes I completed a parser that meets all the requirements.
I think it's time to explain the problem.

Structure of a PGN file

The following text is a PGN game with almost all the problems my parser can even meet.
[Event "Botvinnik Memorial"] [Site "Moscow"] [Date "2001.12.05"] [Round "4"] [White "Kasparov, Garry"] [Black "Kramnik, Vladimir"] [Result "1/2-1/2"] [ECO "C80"] [WhiteElo "2839"] [BlackElo "2808"] [PlyCount "37"] [EventDate "2001.12.01"] 1. e4 e5 2. Nf3 Nc6 3. Bb5 a6 $1 {first comment} 4. Ba4 Nf6 5. O-O Nxe4 {second comment} 6. d4 ; comment starting with ";" up to EOL b5 7. Bb3 d5 8. dxe5 Be6 9. Be3 {third comment} 9... Bc5 10. Qd3 O-O 11. Nc3 Nb4 (11... Bxe3 12. Qxe3 Nxc3 13. Qxc3 Qd7 14. Rad1 Nd8 $1 15. Nd4 c6 $14 (15... Nb7 16. Qc6 $1 $16)) 12. Qe2 Nxc3 13. bxc3 Bxe3 % escaped line - it will be discarded up to the EOL 14. Qxe3 Nc6 {wrong } comment} 15. a4 Na5 oh? 16. axb5 {yet another comment} (16. Nd4 {nested comment}) 16... axb5 17. Nd4 (17. Qc5 c6 18. + Nd4 Ra6 19. f4 g6 20. Ra3 Qd7 21. Rfa1 Rfa8) 17... Qe8 18. f4 c5 19. Nxe6 the end 1/2-1/2
There are two parts. In the first one, there are very simple "tags", with a key and a value. Piece of cake. The second part is the game, which you can take as a whole and consider your job done, as the CPAN module does, or you can read the Standard and extract the information that I need to feed a database with.
Chess games in a PGN are made of, among other things:
- move numbers (digits plus dots : 1. 2. 3.);
- the moves themselves (e4 e5 Nf3 Nc6 and so on);
- braced comments {like this};
- comments to the end of the line, starting with a semicolon;
- escaped lines, starting with a '%';
- Recursive Annotation Variation, which are chunks of moves and comments, enclosed by parentheses (the word "Recursive" here means that they can be deeply nested and very often they are);
- Numeric Annotation Glyphs (a '$' sign followed by a numerical code);
- And finally, they can have errors, such as plain text without braces or unbalanced braces or parentheses.

The desired outcome

The purpose of a PGN parser is to take into consideration the above rules and store the relevant information into a suitable data structure, for future consumption by a chess engine.
The parser I have created, given the above game, produces the following code (slightly adapted from Data::Dumper output):
$game = { 'Event' => 'Botvinnik Memorial', 'Site' => 'Moscow', 'Round' => '4', 'White' => 'Kasparov, Garry', 'Black' => 'Kramnik, Vladimir', 'Date' => '2001.12.05', 'Result' => '1/2-1/2', 'WhiteElo' => '2839', 'BlackElo' => '2808', 'ECO' => 'C80', 'PlyCount' => '37', 'EventDate' => '2001.12.01', GameMoves => [ qw( e4 e5 Nf3 Nc6 Bb5 a6 Ba4 Nf6 O-O Nxe4 d4 b5 Bb3 d +5 dxe5 Be6 Be3 Bc5 Qd3 O-O Nc3 Nb4 Qe2 Nxc3 bxc3 Bxe3 Qxe3 Nc6 a4 Na5 axb5 axb5 Nd4 Qe8 f4 c5 Nxe6) ], 'GameComments' => { '3b' => ' $1 {first comment}', '5b' => ' {second comment}', '6w' => ' ; comment starting with ";" up to EOL ', '9w' => ' {third comment}', '11b' => ' (11... Bxe3 12. Qxe3 Nxc3 13. Qxc3 Qd7 14. Rad1 Nd8 $1 + 15. Nd4 c6 $14 (15... Nb7 16. Qc6 $1 $16))', '14b' => ' {wrong }', '16w' => ' {yet another comment} (16. Nd4 {nested comment})', '17w' => ' (17. Qc5 c6 18. Nd4 Ra6 19. f4 g6 20. Ra3 Qd7 21. Rfa1 Rfa8)' }, 'GameErrors' => { '14b' => 'comment}', '15b' => 'oh?', '19w' => 'the end' } }
Notice that the moves are stored into an array, and the comments recorded with reference to the move where they belonged ('3b' = move 3, black; '6w' = move 6, white, and so on) so that the game, once stored, can be easily recreated.
Also the parser takes care of errors (unbalanced brace and plaintext are stored with move number references).

Parser architecture - first attempt

A simple attack based on RegExes seemed doomed to failure, due to two problems, i.e. nested parentheses and the need of assigning the comments to the appropriate move number.
Therefore I built a hybrid parser, scanning the text char by char and testing the apropriate RegEx for each case. Something along the lines of this (simplified) code:
my %switchcolor = ( 'w' => 'b', 'b' => 'w'); my @gamechars = split //, $gametext; for ($position =0; $position < scalar @gamechars; ) { if ($gamechars[$position] =~ /([\(\[\{])/) { $end_comment = _find_end_symbol(\@gamechars, $position, $1) +; $comment = substr($gametext, $position, $end_comment - $position +1); $game{comments}->{$movecount.$color} .= " " . $comment; $position = $end_comment + 1; } elsif ($gamechars[$position] =~ /([;%])/) { $end_comment = _find_EOL(\@gamechars, $position); # store comment as above } elsif (($gamechars[$position] =~ /[1-9]/) { $num = _find_chunk (\@gamechars, $position, $REnumber); $movecount++; $position += length($num); } elsif ($gamechars[$position] =~ /([OKQRBNa-h])/) { $move = _find_chunk(\@gamechars, $position, $REmove); $color = $switchcolor{$color}; $position += length($move); } else { $position++; # store into errors unless /\n|\s/; } } sub _find_chunck (\@array $pos $regexp ){ # returns a portion of @array starting from $pos # matching $regexp } sub _find_EOL (\@array $pos){ # returns the nearest EOL from @array starting from $pos # } sub _find_end_symbol (\@array $pos $symbol ){ # returns the position of the closing symbol # "{" => "}", "(" =>")", "[" => "]" # in \@array, starting at $pos }
Here $REnumber and $REmove are regular expressions describin how a number and a chess move should be (see below), while _find_end_symbol(), _find_EOL() and _find_chunk() are functions that scan the char array for the desired portion of text, and return the text or a failure code (I didn't include here for simplicity.)
It works. It has the two attributes that an algorithm should have to be reliable, i.e. (1) I know what it does and (2) it does what I want.

Philosophical doubts

I was satisfied with my work. I had achieved something that was not trivial, with a cunning solution. I rewarded myself with some entertaining reading, and I was enjoying Oscar Wilde aphorisms, when I came across this one: I have the simplest tastes. I am always satisfied with the best. This observation triggered some doubts.
Thinking about the little mechanism that I had built, it looked suspiciously like the internal working of a Regular Expression engine. Search for the beginning of a known RegExp and then see if the rest of the expression matches.
I had the distinct feeling that I was reinventing a well known wheel, even though I didn't see which one.

The PerlMonks effect

I decided to write down the details of my findings and ask for advice. Thus, I went through the recommended routine. I searched the Monastery archives, and I searched the FAQs. I found nothing. At least, nothing explicitly related to parsing PGN. But I knew that the monks would look at it from a more pragmatic view, so I tried my best to forget about what I had done so far and looked at the Perl RegEx resources.
I had previously tried a combination of \G and /g without achieving very much. So I looked at the FAQs for these bits, just for the sake of being able to say "I looked at it. I haven't found anything, so please give me some help."
And then I saw the light. In perlfaq6 there is a snippet, presented as it was an ancient carved stone with no relationship whatsoever vith real life.
while (<>) { chomp; PARSER: { m/ \G( \d+\b )/gcx && do { print "number: $1\n"; redo; }; m/ \G( \w+ )/gcx && do { print "word: $1\n"; redo; }; m/ \G( \s+ )/gcx && do { print "space: $1\n"; redo; }; m/ \G( [^\w\d]+ )/gcx && do { print "other: $1\n"; redo; }; } }
It was the key to the real solution. Previously, I had only tried something like
while (m/$REmove|$REnumber|$REcomment|$REeolcomment/g) { print $1; }
And I could catch every piece of the text, but without being able to tell which one I got!.
Now, with the redo trick, the parser could be rewritten this way (again, it's simplified for the purpose of this post):
package PGNParser; # temporary name use FileHandle; use Regexp::Common; sub new { my $class = shift; my $filename = shift || return undef; my $fh = new FileHandle "< $filename"; unless (defined $fh) { return undef } my $self = bless { GameMoves =>[], # game moves GameComments =>{}, # comments with reference to the move gamedescr => {}, # will contain the PGN tags GameErrors => {}, # will contain the parsing errors fh => \$fh # filehandle to the PGN file }, $class; return $self; } sub read_game{ # will read the game from a PGN file # after this, the game text will be in $self->{gamedescr}{Game} } my $REresult = qr{(?:1\-0|0\-1|1\/2\-1\/2|\*)}; my $REmove = qr{[KQRBN]?[a-h]?[1-8]?x?[a-h][1-8](?:\=[QRBN])?}; my $REcastling = qr/O\-O(?:\-O)?/; my $REcheck = qr/(?:(?:\#|\+(\+)?))?/; my $REanymove = qr/(?:$REmove|$REcastling)$REcheck/; my $RENAG = qr/\$\d+/; my $REnumber = qr/\d+\.(?:\.\.)?/; my $REescape = qr/^\%[^\n]*\n/; my $REeolcomment= qr/;.*$/; my $REcomment = $RE{balanced}{-parens=>'{}'}; my $RERAV = $RE{balanced}{-parens=>'()'}; my %switchcolor = ('w' => 'b', 'b' => 'w'); sub parse_game { my $self = shift; return undef unless $self->{gamedescr}{Game}; my $movecount = 0; my $color = 'b'; $self->{gamedescr}{Game} =~ s/$REresult\s*\Z//o; PARSER: { $self->{gamedescr}{Game} =~ m/\G($REnumber)\s*/mgc && do { my $num=$1; if (( $num =~ tr/\.//d) > 1) { $color = 'w'; } if ($movecount == 0) { $movecount = $num; } elsif ($movecount == ($num -1)) { $movecount++; } elsif ($movecount != $num) { $self->{GameErrors}->{$movecount.$color} .= " invalid move sequence ($num <=> $movecount)"; $movecount++; } redo }; $self->{gamedescr}{Game} =~ m/\G($REanymove)\s*/mgc && do { push @{$self->{GameMoves}}, $1; $color = $switchcolor{$color}; redo }; $self->{gamedescr}{Game} =~ m/\G($REcomment|$REeolcomment|$RERAV|$RENAG|$REescape)\s*/mg +c && do { $self->{GameComments}->{$movecount.$color} .= " " . $1; $self->{GameComments}->{$movecount.$color} =~ tr/\r//d; $self->{GameComments}->{$movecount.$color} =~ s/\n/ /g; redo }; $self->{gamedescr}{Game} =~ m/\G(\S+\s*)/mgc && do { $self->{GameErrors}->{$movecount.$color} .= " " . $1; $self->{GameErrors}->{$movecount.$color} =~ tr/\r//d; $self->{GameErrors}->{$movecount.$color} =~ s/\n/ /g; redo }; } }
This second parser is much more "perlish" than the previous one. It's 240 lines shorter and slightly faster. It will produce the same data structure as the previous one, dealing with comments and errors in a very efficient way, at the price of using an external module (thanks to TheDamian for Regexp::Common).

Lessons learned

1. preparing a post for PerlMonks is an enlightening experience, which can lead to a solution even before posting (!!)
2. My process of auto-RTFMing in this case was made possible due to my previous knowledge of the "\G /g" mechanism, which triggered for the research of a RegEx solution. I wonder what I would have done had such a notion not stick to the back of my brain when I was reading the Camel book. I think that I should keep this fact in mind when I advise somebody.

A final plea

I know that there are some real RegEx wizards in the Monastery. So I join the meditation contents to a request for other Monks experience:
Is there any way of avoiding the external module and catch an arbitrary number of nested parentheses with a "normal" Regex? I know that the Owl book says it can't be done, but I would like to put my mind at rest on this issue.
Thanks for your attention and for any piece of advice.

update (31st Jan 2002
I found the solution!
Thanks to Juerd for showing me the target, to TheDamian for providing the weapon and to blakem for adjusting my aim. Thus armed, as somebody would point out, I can shoot myself in the foot. :)

The parser with this adjustment is twice as fast as the version using the module. Considering that it has to read 1.3 million records, it is a substantial improvement.

update (09th Feb 2002
The parser is now in the CPAN
 _  _ _  _  
(_|| | |(_|><

Replies are listed 'Best First'.
Re: A demanding parser
by blakem (Monsignor) on Jan 25, 2002 at 15:14 UTC
    I hate to post a short response to such a well thought out and interesting post, but the best resource I've found on \G and /g is chapter 6 of japhy's new book.

    Although I think you have implemented the recommended "inchworm" approach, this chapter will help you understand exactly what the code is doing behind the scenes.


Re: A demanding parser
by Juerd (Abbot) on Jan 25, 2002 at 16:43 UTC

    Is there any way of avoiding the external module and catch an arbitrary number of nested parentheses with a "normal" Regex? I know that the Owl book says it can't be done, but I would like to put my mind at rest on this issue.

    It depends on what you call a "normal" regex. If normal means without any perl specific things, it can not be done. But if you don't mind a perl specific regex, perlre has the solution:

    The following pattern matches a parenthesized group:
    $re = qr{ \( (?: (?> [^()]+ ) # Non-parens without backtracking | (??{ $re }) # Group with matching parens )* \) }x;

    2;0 juerd@ouranos:~$ perl -e'undef christmas' Segmentation fault 2;139 juerd@ouranos:~$

      Is there any way of avoiding the external module and catch an arbitrary number of nested parentheses with a "normal" Regex?
      use Regexp::Common; $str =~ /$RE{balanced}{-parens=>'()'}/
        That's what he does already. He asked if it were possible without the module. (Not a very good idea, but I don't know the motivation (education, perhaps))

        2;0 juerd@ouranos:~$ perl -e'undef christmas' Segmentation fault 2;139 juerd@ouranos:~$

      Thanks for the tip. I am not sure I understand how to use it, though.
      My purpose, as you have pointed out, is to replace Regexp::Common with some normal Perl RegEx. By normal I mean a non-module dependant expression.
      As for the motivation, you guessed right that it's related to education. Personally, I wouldn't bother. I need to distribute this module as part of a more extensive educational material aiming at the build-up of a huge database. I would like to avoid pointing to a CPAN module, since many people in the audience are not experienced Perl users. They should just copy this module to their computers and execute the import/export script.
      Of course I can provide them with a copy of the module, or instruct them to connect to the CPAN, download the module and install it, or use "perl -MCPAN -e shell" but it would steal valuable time from my lectures.

      That aside, here is a test script for your RegEx, which does not seem to give me what I want.
      Was it my misunderstanding, or were you trying to show me how to catch the inner parenthesized text only?
      #!/usr/bin/perl -w use strict; use Regexp::Common; my $re = qr{ \( (?: (?> [^()]+ ) | (??{ $re }))* \) }x; my $input = "aa bb cc (dd ee (ff gg (hh) jj) kk)"; print "With module\n"; while ($input =~ m/(\w+|$RE{balanced}{-parens=>'()'})\s*/g) { print "$1\n"; } print "With recursive RegExp\n"; while ($input =~ m/(\w+|$re)\s*/g) { print "$1\n"; } __END__ # output: With module aa bb cc (dd ee (ff gg (hh) jj) kk) With recursive RegExp aa bb cc dd ee ff gg (hh) jj kk
      Found the problem. Recursive RegExes don't work properly with use strict
      my $re = qr{ \( (?: (?> [^()]+ ) | (??{ $re }))* \) }x;
      no strict 'vars'; $rec_re = qr{ \( (?: (?> [^()]+ ) | (??{ $rec_re }))* \) }x; my $re = $rec_re; use strict;
      makes the same output from both regexes.
       _  _ _  _  
      (_|| | |(_|><
      Wait a second - is that really a recursive regex there? :-o

      Makeshifts last the longest.


        2;0 juerd@ouranos:~$ perl -e'undef christmas' Segmentation fault 2;139 juerd@ouranos:~$

(tye)Re: A demanding parser
by tye (Sage) on Jan 26, 2002 at 00:11 UTC

    I'd certainly drop the && do { ... redo } hack which I find to be not even close to worth the "surprise" factor (plus the maintainance problems of not noticing when you forget a "redo"). And I'd add a different hack (single-argument for) which I think offers a big win in this specific case:

    for( $self->{gamedescr}{Game} ) { while( ! m/\G\z/mgc ) { if( m/\G($REnumber)\s*/mgc ) { my $num=$1; #... } elsif( m/\G($REanymove)\s*/mgc ) { push @{$self->{GameMoves}}, $1; $color = $switchcolor{$color}; } elsif( m/\G($REcomment|$REeolcomment|$RERAV|$RENAG|$REe +scape)\s*/mgc ) { #... } else { die "Invalid input..."; } } }

            - tye (but my friends call me "Tye")

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlmeditation [id://141468]
Approved by root
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others romping around the Monastery: (6)
As of 2017-03-30 01:56 GMT
Find Nodes?
    Voting Booth?
    Should Pluto Get Its Planethood Back?

    Results (353 votes). Check out past polls.