Not strictly a perl question, still I like to get some help. How one goes about finding patterns?. This question came up when I wanted to parse the chess games. Some conclusion derivation from parsing chess games could be useful to understand the player's play or the game.

Thanks,
Artist.

Replies are listed 'Best First'.
Re: Finding Patterns
by hv (Parson) on Sep 22, 2004 at 22:56 UTC

I think there are two very different things you *might* be talking about here: "given a pattern, and some data, how do we find that pattern in the data", or "given some data, how do we find interesting patterns in it".

The first is ludicrously easy compared to the second, but has two areas of difficulty. One is how we represent the pattern - the language of regular expressions is our canonical example of "a way to represent patterns" over character streams. Different domains tend to give different languages, whether it is the type of data that is different (ie not a character stream), or the type of pattern; indeed, you could argue that the symbology of mathematics is the canonical "pattern language" for numbers (or even go further and argue that mathematics itself is "the study of patterns").

The second difficulty with the first question is finding matches for the pattern in a particular dataset. What is difficult here is entirely dependent on the type of pattern you are looking for: to some extent the language of regular expressions only bothers to provide mechanisms for expressing patterns we actually know how to check for. If you want to know more about how to match regular expressions, you might find it easier to look at the source code for PCRE than at perl's own regular expression engine.

The second question, "how do we find patterns without specifying in advance what patterns to look for" is very different, and as far as I know we don't have any real answer for that at the moment other than to point a human at the data.

I can identify two specific difficulties here, though I'm sure there are more: how do we decide (or specify) what patterns would qualify as "interesting", and how do we avoid restricting what types of pattern we are capable of discovering.

To me, these skills - spotting a new pattern, and taking advantage of it - are the essence of intelligence. It is far more efficient to remember a rule that applies to everything than to remember individually for each thing that the rule applies to it, and looking at it that way makes clear that "finding a pattern in the data" is very similar to "finding a way to compress the data". However, not all "ways to compress the data" necessarily represent "interesting patterns", and that brings us back to the nub; in particular, neural nets aren't very helpful in this respect - you may well be able to design a neural net that will once trained be very good at compressing data, but it won't be able to tell you what it's doing, nor will you be able to determine it from analysis of the system, in a way which has any chance of being expressible as an "interesting pattern".

Some of these areas are touched on in Fluid Concepts and Creative Analogies, but I need to track down some of the original papers discussed there to find some of the more interesting bits. In particular, I read there that for 'Seek Whence', Marsha Meredith developed a language for expressing the patterns of a particular class of integer sequences in such a way as to preserve the types of things that humans would choose to express; for example a sequence like:

```  1 1 1 2 1 2 2 3 1 2 3 3 4 1 2 3 4 4 5 ...
would tend to be viewed by a human as "a series of ladders of increasing height from 1 to n, with the penultimate entry duplicated"; most formal pattern languages would be capable of expressing this pattern, but would probably not express it like that - they'd probably split each ladder into two ranges, and might have to add a special-case for the start to cope with the "degenerate" first ladder with no "penultimate entry".

Finding better ways to express patterns is definitely a step in the right direction, and I suspect that it is only when we've gone further in extending and unifying "how to express a pattern" that we'll be able to improve our ability to reason about patterns themselves, and maybe that is the route that will let us work out how to do the rest.

However, within that book (which I've not finished), I've not yet seen any attempt to tackle the second problem: how to avoid restricting the type of pattern we can find. I don't know how to get there from here - we somehow need to be able to invent a new concept, determine whether it is a useful one, and if so add it to its dictionary/network of concepts to take advantage of in the future. Humans manage to do this remarkably well, bootstrapping from (essentially) nothing at all

Hugo

Re: Finding Patterns
by BrowserUk (Pope) on Sep 23, 2004 at 01:32 UTC

The problem with your question as asked is that there is no way to instruct a computer to "search data and find information". You have to instruct it what the data looks like, and what would constitute a pattern.

For example: You could inspect a file containing text and conclude: It contains a series of numbers between 1 and 255. Or between 0 and 65535. Or 0 and ...; or it contains letters and numbers and spaces; or words; or sentences.

The point is that at every level, you the programmer have already applied some knowledge to the algorithm. But even then, all you have derived from your data, is information--the number of characters; their frequencies etc.--it still requires you the user of the program to apply further heuristics to achieve your goal and derive knowledge.

About the best you can do when it comes to deriving (unforseen) knowledge from data, it to make some guesses about what might be hidden in the data, and use a program to process and display that data in such a way, that the hidden knowledge becomes easily apparent.

An example of this is testing the output of PRN generators for patterns and correlations. One way is a method called spectral testing. (See Re: Testing for randomness (spectral testing) for a crude example of this). This technique can be applied to many types of data.

In your example of analysing a players chess strategy, you had a tentative idea that maybe it would possible to detect if a player tended to castle as early as possible in a game. You might test this theory by plotting his games on a spectral grid and looking for "clumping". To do this, you would need to derive 2 numerical values (or 3, but lets start simple:).

For each game, you derive your set of pairs of numbers and plot them onto the grid using one as the X-coordinate and the other as the Y. You then increase the numeric value of the colour plotted at the corresponding point. As you plot more and more games, the intensity of the colour at various points on the plot will increase. If the whole plot tends to increase evenly, you have essentially proved that there is no correlation between the two variables you have plotted. If the colours tend to clump at various points on the overall plot, then you may have discovered some level of correlation. It's then up to you to decide what if any significance that correlation has.

You might for instance plot the move number along one axis; and the type of move made along the other. The problem then becomes how to categorise the moves (into an integer). You might choose to categorise castling as 1; a non-attacking move (a move that does not place an opponents piece uneder attack) as 2; an attacking move as 3; a trade of pieces accepted as 4; a trade of pieces declined as 5; and so on. This might lead to the conclusion that the player prefers to start out with a land-grab before going on the attack. Or that they like to trade pieces whenever possible; etc.

So, there are three parts to the problem:

1. Deciding what variables to plot.
2. Deciding how to represent those varibles into integers.
3. Deciding what, if any, significance any clumping that is detected has.

The last part is by far the hardest. For example, you can analyse a writer's output by plotting word pairs and word triples. With enough input, the spectral patterns can become indicative of authorship; but in and of themselves, such plots tell you very little about the author.

There is still too much background noise in the data to be able to tell anything about their education or origins etc. For example, There are certain pairs and triple of words that crop up with a high frequency regardless of the writer. To remove this noise from the plot, it becomes necessary to "normalise" the plot. One way of doing this is to start out with a plot of a mid-range colour (for 8-bit colour, you start by setting every pixel to 128). You then process a whole range of inputs from many authors and types of text, but instead of incrementing each pixel. you decrement it. This becomes your normalised background.

You can then use this on a batch of samples of your author's work, this time incrementing instead of decrementing. Where common pairs and triples occur in the authors work, they will tend to return the appropriate pixels back to the mid-range colour. Where he uses unusual pairings of words, the pixels will tend to lift the pixel from the mid-range. By applying a high pass filter to the results, you will tend to isolate the authors particularities from the background. You may then have the basis for further linguistic analysis.

With the right equipement and/or software, this type of analysis can 'discover' correlations where they would otherwise be very hard to detect. The nice thing about it is that it can be relatively trivial (and fast) to perform an analysis of many different pairings of variables and just see what shows up. The downside is that once you think you have detected a correlation, it then falls to human ingenuity to try and apply some reasoning to it:)

Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail
"Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon
Re: Finding Patterns
by Tii (Monk) on Sep 22, 2004 at 18:20 UTC

This could become a very interesting thread, especially to me since I just finished reading a long AI discussion here on the site.

However, could you be more detailed? IMHO, your question seems very broad in scope and could use some focusing.

Here are some questions that came to mind: You mention parsing chess games. In what format is the data that you would like to parse? Are you considering writing a program that learns how to play chess?

Tii
We always learn something from our environment and find patterns. If given numbers '2 3 5 7 11 13 17 19 23..' we immidiately recognize them as 'prime numbers'. It is easy for computer to recognize patterns from numbers. People usually make decision based observed patterns. Some creative people go on to create new patterns in life... etc.

Chess is game where we can deduce number of patterns. We may not know sometime, what pattern we are looking for. Ex. (1). We may find out from analyzing the game that player 'A' always castle as early as possible in the game. Player 'B' wait till he realizes that his king is in danger. (2) We can find typical 'forks' that a chess player is using.

These can reveal the "chess player's chess-nature" player in descriptive terms. Just like DNA can reveal blue print of our life. I may be able to do chess parsing from number of different format. Ex. PGN which can use Chess::Pgn. Ultimately, I like to write a good chess engine.

One of the problems of looking for patterns is matching known categories to the input. In your example, if prime numbers were not understood by the searcher, the pattern might be filed as "unknown".

On the other hand, if the searcher had a broad set of analysis tools [or access to Sloane's database], it might be able to build the category of prime numbers based on the given input (no divisors between 1 and N).

But the searcher must be careful to employ some simplification schemes (perhaps an Occam's Razor module?), because there are numerous formulas that will have those numbers on the curve, or as axis crossing points, etc.

I suspect that a good searcher will have to have or evolve infrastructure for category description and matching, as well as an engine for developing new categories and testing hypotheses.

Finally, I wasn't clear whether you meant "find the pattern in..." (such as the primes example), or "generalizations in time and/or space" (which is nearer the chess analogy). One involves induction, the other deduction [if I've got my head on straight today].

-QM
--
Quantum Mechanics: The dreams stuff is made of

The issue is when you talk chess you talk _very_ large data-sets. You can keep track of games (both table state and players move) and build out your statistical pool every game that is played. Given a very large number of games with that player you may pick up statistical anomalies vs a large set of other players games. There are so many combinations of moves and board states that you wold be hard pressed to find a large enough population of data on one player to build a "ok" data-set against. And even then it is really only useful if you have that same volume from a ton of different people.

To bring this back to your example, where we may near instantly recognize the small set of primes that you listed above, can that really be said of 6 random board states from historical Karpov games? Yes, your mind is trained to grasp patterns such as the one above. Yes, your computer can run tests on the numbers above and come back and tell you that they are prime -- but only if tests are written to do so. The problem with attaching those principals directly with chess are great. Chess is more complex. Chess is played by humans (the first time I play you is not the same as the second time. I learn, think and evolve) which makes the data-sets less than static. At most you would be able to derive Fingerprints for players with large enough statistical anomalies -- but then again if they continue to play the same way their opponents will eventually learn and overtake.

-Waswas
Re: Finding Patterns
by Zed_Lopez (Chaplain) on Sep 23, 2004 at 16:55 UTC
Re: Finding Patterns
by TrekNoid (Pilgrim) on Sep 23, 2004 at 14:55 UTC
If you have large datasets, then I think you might want to do some looking into 'Data Mining'.

Kurt Thearling's Page has some good tutorials and white papers on the subject of data mining, which he defines as: the automated extraction of hidden predictive information from databases

If your datasets aren't large, however, then I don't know that this will be much help.

Trek

Re: Finding Patterns - a List of AI/NLP nodes
by erix (Prior) on Nov 14, 2004 at 15:02 UTC

There is also a node-list on bioinformatics, but that may be less interesting for chess parsing.

Btw, I invite everyone to send me more node_id's for these subjects. Just /msg the node_id to erix, I'll check for doubles myself.

Re: Finding Patterns
by dimar (Curate) on Nov 14, 2004 at 17:45 UTC

This is one of those kinds of questions you could answer with either an entire bound volume (scholarly method); a series of pre-fab perl scripts (utilitarian method); or with the nuanced flick of a wrist (zen method).

My method is far more prosaic: answer with the first three things that come to mind off the top of my head.

• Chapter X "Levels of Description and Computer Systems" (see especially 'Chunking and Chess Skill'). Godel Escher Bach
• language specific vs. author specific traits in a body of text sample link
• the shape of a song (warning: uses JAVA applets!)

The second and third items represent my own speculation that you could: 1) 'fingerprint' a particular chess-player's style of play simply by a textual analysis of a notated game; and 2) visually represent (and thereby distinguish) styles of play in novel ways...

Fun stuff...

... yes yes ... the correct spelling is "Gödel" ... please take this up with isbn.nu, not me, thanks.

Re: Finding Patterns
by TedPride (Priest) on Sep 23, 2004 at 03:39 UTC
Some patterns to look for in chess might be:

a) How many moves (percentage) a person tends to make with each type of piece;
b) Specific formations a person commonly uses (ignore all but the x number of furthest forward pieces)
c) Most preferred x number of moves starting sequence

All of these will help you in disrupting the lines of play that a person prefers to use. Does he like to use knights? Trade for his knights. Does he like a certain starting play? Make sure to do something different - or spend your cycles doing a few extra iterations on the lines of play he's most likely to do.

The problem will of course be speedily rendering the board into meaningful numbers that can be quickly matched to a database of past behavior. This could take years to do properly, and I for one wouldn't have the patience for it :)