|Perl: the Markov chain saw|
Choosing a data structure for AI applicationsby Ovid (Cardinal)
|on Jul 16, 2002 at 01:46 UTC||Need Help??|
Ovid has asked for the
wisdom of the Perl Monks concerning the following question:
Why have I posted this?
As I mentioned in the thread AI in Perl - proof of concept, I would like to create a Prolog-like Perl module (perlog?). This would be useful for people who would like to learn about AI but leverage their Perl knowledge. I want to design the syntax to be similar to Prolog because Prolog is fairly easy to learn and tends to naturally model the underlying problem.
This is actually a pretty long post and covers some fairly tricky things so it might be more appropriate to have some sort of collaborative effort started to work on this (this is open source, after all :). However, past collaborative efforts that were started here seem to have fizzled, yet many people enjoying responding to isolated posts, so I'm considering making this a "collaborative" effort in the sense that I make weird, long-winded posts and monks try to help me out -- nothing new there, huh?. If you like this idea, keep reading. If not, move along folks; nothing to see here.
Before reading the rest of this post, it might be helpful for you to read a Prolog tutorial, though this is by no means necessary.
For those who may be concerned about duplication of effort, I have already checked out AI::Proplog (data structures are too simple), Language::Prolog::Interpreter (same problem), and Prolog::Alpha. The last module looks very interesting, but some of my objections about Prolog apply. There are other issues that I have with the last one, but they're beyond the scope of this post. I may return to the latter module and look at it more closely.
The basics of what needs to be implemented
Let's say I have a set of facts about the things that Ovid gives to others. These facts may resemble the following:
Note that everything starts with a lower-case letter. In Prolog, if an argument starts with an upper-case letter, it's a variable. This artificial constraint is part of the reason why I don't want to model Prolog too closely.
With the above set of facts (called a 'database', in Prolog), I can then ask questions. Let's say that I want to know if Ovid gives a book to kudra. I can ask this:
Prolog would respond "yes". However, if I ask the following:
Prolog would respond "no". This is another reason why I would prefer to make something similar to Prolog, but still being Perl. A Perl programmer would expect a true or false response, but 'no" evaluates as true in Perl.
Now, let's say that I want to know to whom I would give books. I could ask this (note that the word "Who" is capitalized, making it a variable):
If you know Prolog, you realize that I skipped some stuff there. Deal with it :)
With the data structure that I used in the Proof of Concept post, asking questions like this (particularly with facts that reference other facts) results in my using a lot of map and grep statements. With only a few facts, this is not an issue. However, true AI systems can have huge databases and this solution is not scalable. While I realize that some might scream "premature optimization", I submit that I have a known systemic performance issue and since I will be eventually auto-generating much of the code based upon the underlying data structures (trust me for now, it's mandatory), I need to work hard up front on this one piece of optimization. I have other ideas for improving performance, but I don't intend to implement them as they can be put in place after I know I have working code.
One idea that I've considered is simply using HoH and not using the values:
With that, to find out who Ovid gives books to, the code is simplified - ignoring for the moment that we need to check for the existence of allhash keys to avoid auto-vivifying data.
Now, that's not much of a win, but it is when we're looking up a single key. If we want to know if Ovid gives books to merlyn:
With my previous data structure, I would have had to iterate over all of the recipients. Clearly this is not scalable. If we have a database of tens of thousands of facts (large AI systems often have millions of facts) then the ability to quickly isolate a single fact is crucial.
This still doesn't quite work, though. For example, what if I want to have a list of everyone who gives books to others, regardless of the recipient? In Prolog I would ask this:
Note: The final underscore says "I don't care who gets the book". Also, associating the results with a variable ("Who") is known as "Unification").
In Perl, I'm back to an iterative solution.
Hmm... this is still better, but not good enough. Unification is one of the thorniest problems with trying to implement Prolog in Perl. It's not difficult to do. It's difficult to do quickly.
This solution also has the side effect of "randomly" ordering the facts in the database. I'm not yet sure if this is an issue. Some of the designs I have considered involve binary searches or weighted searches, so clearly a hash structure would have to go.
So, it appears that I haven't solved my problem, but it gets worse. What if I want to know what the title of the book is? In Prolog, I might embed another fact in the first fact.
Now that I've given merlyn a book, I can find out what book I gave him.
Essentially, I can embed facts in other facts. This, to me, suggests "references". What if kudra, anxious about merlyn's faltering ability with Perl, also wants to loan him the llama?
As far as Prolog is concerned, both kudra and Ovid have given merlyn the same book, so I can store a reference to this book in one spot and have both facts refer to it (if I want to ensure that Prolog knows these are distinct books, I have to add more information to the book fact).
If you're beginning to suspect that there are some Lispish aspects of all of this, let me further confirm your suspicions. We can also manipulate lists in Prolog. What if I want to specify that three different items are in a kitchen drawer?
Without going into all of the semantics of lists in Prolog (confession: I'm relearning some of this as I go -- not a good thing to be doing when I'm trying to reimplement this in Perl :), you can see how that also suggests a reference. And, naturally, the elements of a list can be constants, variables, facts, or other lists.
What data structure(s) do I use?
Note: I have no illusions that I can design a system suitable for large AI applications. Perl is simply not going to be fast enough to handle databases of millions of facts with thousands of rules, particularly when backtracking and recursion cut in. But I think that this might be practical as a learning tool.
As for data structures, I would kill to have Perl6 pairs, as they would solve my problems. Since Perl6 is not an option at this time, I would be willing to consider some sort of Inline::C solution. I haven't tested it, but if I could make a hash key bi-modal so that it represents both a string and an integer, I could use the integer as an array index to pick up additional properties about a given key.
Unfortunately, this has problems with mathematical functions.
Another possibility would be to use two data structures. One would contain the original facts and the second would contain meta-data about those facts. The issue with this solution is figuring out how to relate the two data structures. Perhaps I could embed a "key" in the facts has and parse out the key every time I grab data, but that seems like an ugly solution.
I know that other solutions are available. My main concern is that these other solutions be fast and relatively straight-forward. By straight-forward, I mean they can be complex, but if they have a bunch of special cases, they will much more likely to be buggy. If I have to test if something is a scalar, a hash ref, an array ref, a sub ref, etc, then it's probably not the best solution. On the other hand, if I have to gather data from more than one source, but the data acquisition remains fairly constant regardless of the data that I'm accessing, then this is workable. Slow, perhaps, but workable.
One final note: there needs to be some persistence mechanism for this to be truly useful. I don't think anyone is going to be terribly interested in AI programming which requires that the database or rules be recreated every time the program is run.
Join the Perlmonks Setiathome Group or just click on the the link and check out our stats.