Bayesian not-for-spam

Kickstart has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
•Re: Bayesian not-for-spam by merlyn (Sage) on Jul 14, 2003 at 22:47 UTC
See AI::Categorizer::Learner::NaiveBayes. -- Randal L. Schwartz, Perl hacker Be sure to read my standard disclaimer if this is a reply.	[reply]
Re: Bayesian not-for-spam by tilly (Archbishop) on Jul 14, 2003 at 23:48 UTC
You look like you might be trying to predict things like the stock market. If so, then please note that both theory and practice indicate that stock prices follow a random walk. There is no correlation between what they have done in the past and what they will do in the future. (Other than an overall tendancy to climb about 10% per year. OTOH that is a trend which we are currently well above the historical average for...) Thus a computer program to predict the prices is unlikely to yield anything useful. Though our human tendancy to see patterns whether or not they are there will cause you to find all sorts of spurious things if you tinker enough..spurious connections that won't hold up with future data.	[reply]
Re: Re: Bayesian not-for-spam by wufnik (Friar) on Jul 15, 2003 at 10:39 UTC
hmmm, tilly, whether you accept the random walk hypothesis depends on how orthodox your economics are. Lo & MacKinlay, at MIT, in "A Non Random Walk down Wall Street" (1999) obtained overwhelming rejections of random walk; there is quite compelling evidence against it. i append a quote from Niederhoffer's biography you might find interesting. This theory and the attitude of its adherents found classic expression in one incident I personally observed that deserves memorialization. A team of four of the most respected graduate students in finance had joined forces with two professors, now considered venerable enough to have won or to have been considered for a Nobel prize, but at that time feisty as Hades and insecure as a kid on his first date. This elite group was studying the possible impact of volume on stock price movements, a subject I had researched. As I was coming down the steps from the library on the third floor of Haskell Hall, the main business building, I could see this Group of Six gathered together on a stairway landing, examining some computer output. Their voices wafted up to me, echoing off the stone walls of the building. One of the students was pointing to some output while querying the professors, "Well, what if we really do find something? We'll be up the creek. It won't be consistent with the random walk model." The younger professor replied, "Don't worry, we'll cross that bridge in the unlikely event we come to it." I could hardly believe my ears--here were six scientists openly hoping to find no departures from ignorance. I couldn't hold my tongue, and blurted out, "I sure am glad you are all keeping an open mind about your research." I could hardly refrain from grinning as I walked past them. I heard muttered imprecations in response. respectfully, ...wufnik -- in the world of the mules there are no rules --	[reply]
Re: Re: Re: Bayesian not-for-spam by tilly (Archbishop) on Jul 16, 2003 at 21:57 UTC
Oh, I don't doubt that the random walk hypothesis has limits. Warren Buffett is enough proof of that. However it is true enough that it is highly unlikely that a casual amateur should be strongly advised to not try beating random stock movements. And it is an accurate enough approximation that, whatever the flaws, it has become established orthodox financial theory.	[reply]
Re: Bayesian not-for-spam by chromatic (Archbishop) on Jul 14, 2003 at 21:50 UTC
What leads you to believe this is a question a Bayesian system can answer? As I understand it, a Bayesian system answers the question, "What is the probability this item is like either of these two opposite poles?" It's really a yes or no question. "Is this spam or ham?" If I understand you (and Bayesian filters) correctly, this is not the question you want to answer.	[reply]
Re: Bayesian not-for-spam by CountZero (Bishop) on Jul 15, 2003 at 06:00 UTC
I use a Bayesian-script for spam-protection and it works quite well, but the technology behind it does not seem particularly suited for predicting price changes. In my Encyclopedia Britannica I read Bayesian estimation: statistical technique for calculating the probability of the validity of a proposition on the basis of a prior estimate of its probability and new relevant evidence. Also note that Bayesian analysis is very sensitive to the distribution of the input. The best you can hope for is to code all relevant input parameters (and until now nobody has been able to identify all parameters which govern price-changes) and the resulting price-change (increase, stable, decrease), so as to validate the probability of a proposition that a new set of input data will lead to a lower, higher or stable price. Its predictive power will be minimal I fear, even with full knowledge of all relevant data. CountZero "If you have four groups working on a compiler, you'll get a 4-pass compiler." - Conway's Law	[reply]
Re: Bayesian not-for-spam by dbp (Pilgrim) on Jul 15, 2003 at 08:12 UTC
I doubt you could do what you want to do with a system like those used for spam detection. Such systems are typically naive bayes classifiers. They are naive because they assume all variables in the analysis are conditionally independent. For example, when comparing text a naive bayes system assumes that each word in the text is independent of every other word. This is obviously a completely bogus assumption but experimental results (and real world experience) shows us that textual classification doesn't suffer as a result of this assumption. This doesn't just apply in the case of spam, naive bayes classifiers have been trained to categorize posts into newsgroups and the like (see Mitchell, section 6.10). The problem is that your problem is much more difficult than the text classification problem and I'd expect this assumption to be much more damaging to your task. There are other bayesian methods such as optimal bayes classifiers and bayesian networks. The Mitchell book noted above gives a nice overview. Note that all these methods are based on assigning probability to different hypotheses. You have a continous hypothesis space which makes this difficult. You could discretize your hypothers space by creating a finite discrete set of price ranges or by attempting to predict if the price will rise, fall, or stay the same. The other problem is that optimal classifiers and bayesian networks are costly to train. I'm not even sure there is a polynomial time algorithm for training bayesian networks. These techniques are closely related to maximum likelihood estimation, markov chain monte carlo, and other bayesian techniques commonly used in a variety of fields. This stuff is pretty hard-core and extemely processor-intensive. I work with a political scientist who does bayesian analysis of the supreme court and his simulations can run for weeks on an openmosix cluster of high-end machines. The simulations are written using a c++ library and I've spent a great deal of time optimizing it. This is a domain where an interpreted language like Perl simply doesn't shine. Honestly, we'd be better off in terms of speed in c (or better yet but god forbid, fortran) but we're trying to strike a balance between efficiency and ease of use in the library. You are attempting to tackle a very hard problem. Assuming there are recognizable patterns in your data (the stock market is highly volatile but market prices of goods are a bit easier to predict) the patterns will likely be highly non-linear as a function of the imputs and your data will be incredibly noisy. Certain types of neural networks may fit your problem; they handle continuous inputs and outputs well and recurrent versions can deal with time series. Traditional econometric time-series techniques might work as well assuming your problem ends up being at least approximately linear. Essentially, what I'm saying is that choosing a learning/classification technique is going to be your biggest problem. You may have to try a few different techniques and tweak them extensively before you get anything resembling an accurate prediction. Implementation is downright trivial in comparison.	[reply]
Re: Bayesian not-for-spam by hawtin (Prior) on Jul 15, 2003 at 08:02 UTC
Like others here I don't think you want Bayesian logic, it sounds like what you are trying to do is similar to predicting the weather based on today's readings. I would suggest that what you want is probably fuzzy logic. As it happens June's Perl Journal has an article on that topic (and its only $12 for a year, subscribe now :-) ). It mentions AI::FuzzyInference as a good CPAN module to use. I would also suggest looking at Peceptrons and neural nets. There have been all sorts of books on complexity and predicting dynamic systems. For example I found the book "The Recursive Universe: Cosmic Complexity and the Limits of Scientific Knowledge" by William Poundstone really good	[reply]
Re: Bayesian not-for-spam by wufnik (Friar) on Jul 15, 2003 at 11:08 UTC
howdy Kickstart merlyns link is very useful for discrete data, and a naive bayesian network. The bayesian approach is not limited to these, more advanced bayesian nets allowing causality to be modelled in a statistically rigorous and useful way. Neural Nets, mentioned above, not to mention decision trees, are alternatives here. unfortunately, there is no software for the advanced bayesian nets in perl. if you are lucky enough to posess matlab, you will find BNT, by Murphy, which is GPL'd, very useful. if the naive bayesian approach is sufficient, and it should be the start (and is by no means naive) you will find the question you face is how do you discretize your data? day of the week etc is easy: the problems you will face will be in dealing with continuous variables like price. this discretization should not be linear if you are to make the most of the information that is there; you could also use an 'expert' to help you decide on the bands. this discretization is essential for typical bayesian net methods to work, so it is worth devoting attention to it. once you have done this, just loop through your db and determine the conditional probabilities. feed these into your naive bayesian net, and robert is your uncle. if it works, remember me in your will ...wufnik in the world of the mules there are no rules	[reply]
Re: Bayesian not-for-spam by hiseldl (Priest) on Jul 15, 2003 at 20:59 UTC
If you want a predictor, you should probably use a back-propogation or feed-forward neural net; take a look at Mark Jurik's site where there are some technical reports, etc. that may help (this is not an endorsement, just a suggestion). -- hiseldl What time is it? It's Camel Time!	[reply]
Re: Bayesian not-for-spam (re: stock Market) by AssFace (Pilgrim) on Oct 22, 2003 at 03:44 UTC
I should start off by saying that I own a company that does analysis on the stock market and provides that technical analysis to subscribers and also trades on portfolios based on that information. When I first got into Perl, it was because of an encryption problem I was working on (the Poe Cipher). With that, I learned about Markov Matricies, Bayesian analysis of language, and how they related to Perl. Having been obsessed with the stock market since 4th grade, I immediately tried to think of ways that Bayesian type analysis could be used to predict the stock market. I knew that I wouldn't be the first to have thought of it, but I wanted to resolve if it was feasible or not. If you are going to do it - you will have very little success putting in information into it and having it say that N days from now the price will be X. You will have slightly better success having it output and say that over the next N days, the price will go up/down Y percent. And assuming you have the right code, you will have relatively decent accuracy with it saying that the market will go up/down in the next N days. I have since moved on to neural nets and genetic algorithms that mix traditional trading methods with non-linear analysis that we don't necessarily intuitively grasp on our own (most things that we have to work out and are consciously mathematically in daily life are linear). The amount of computing power that is involved is a bit much though - I use Perl and ForkManager. That then iterates over some data and feeds in file names (ticker symbols) into a C program which then runs - as those build up, the cluster I have gets fed the programs to execute. I analyze thousands of days of data, thousands of times in a row with the C program and each node gets it done in about a second. There are thousands of tickers just in the US markets. So on a single processor, that is still going to take some time to churn through. And that is still a relatively basic system that I have - I have more complex code that I have in the works that is likely going to take 5 times as long to run. That said, I have written some more basic scripts that are actually very fast in Perl (largely due to the help of Memoize since there is a lot of analysis on the same data over and over in loops), and they are showing that they might actually be more useful than the neural nets and the like. Do keep in mind that you don't want to have the code know everything that has happened in the past - otherwise it will tell you what it would have done back then. The stock market changes - you want it to figure out a generalized rule that it can follow that is right N% of the time (where N is sufficiently high to make you money, or rather, not lose you money) on as little data as possible - that way as times change, that rule should still work well even though the environment has changed. Also, you don't want to feed in dates. That will help somewhat in that it will learn when earnings reports are and if it is a good system, it will learn to stay out of the market at that time. But for the most part, you don't want to feed in raw data - you want to run it through some normalization functions first - squash it down. If it learns what to do when the price is 56, but then 3 years later the price is 13, the program doesn't know what to do. So you want it to analyze the numbers so that they are always in a normalized range and act accordingly on that. It is also worthwhile to determine what stocks move with or against the stock that you are looking at. If KO goes up, does PEP tend to go up to, or go down? That adds a tremendous amount of data on top of the problem of analysis. For those that say that the market is random - I would say that there are many out there that are perfectly happy making money off of what they see as non-random. People on either side stand to benefit from being right in that assertion. I personally hope to start a hedge fund within the next ten years if they haven't gone under due to overregulation by then. Until then, I will be making money doing what I do now. (also I will add that a neat thing you will notice is that formulas that will aid you in analysis are formulas that will work in many different disciplines - hence why so many investment banks were hiring physicists back in the early '90s - things that work in physics and heat flow work in currency trends, work in stock market movement at monthly or 5 minute bars - the main difference to note is with weather. In weather you will see that you can make a prediction and it has no effect at all on the outcome. I could say it will rain and everyone in the area will put on a raincoat - that in itself won't make it rain or not rain. But in the stock market, depending on what analysis you are doing - some are more easily broken than others - by pointing something out and acting on it, you in effect break the system and it will behave differently.) ------------------------------------------------------------------- There are some odd things afoot now, in the Villa Straylight.	[reply]


Come for the quick hacks, stay for the epiphanies.
	PerlMonks