Be warned. As a neuroscientist, I'm in the data analysis business, more than the mining variant. The enormous data flow and the nature of my experiments generate however cause my analysis to mimick the mining a bit IMHO. With this disclaimer in mind:
Perl indeed is not an analysis tool per se. It is however undismissible in its ability to handle varies formats but also in the development time of your scripts. You will end up using different tools right through each other:
**Use the right tool for the job!**
This is essential. Always decently think it through before you do something with a certain tool. Can this tool do the job? How much time will I have to spent learning the tool? How much time will I spend coding? How much time will I spend chrunching numbers (or swapping memory space ;-)? Of course don't spend more than appropiate time figuring this out.
It really depends on what you're going to do which tools you want to use. For web grabbing, text manupulations, file manipulations and reporting perl is the tool you need. If you really have to work with matrici of data (so more variables per item or more items per variable than you can handle easily) I seriously would stay away from spreadsheets. They are pretty inflexible when it comes to restating your computations or recalculate your reports/graphs. Believe me, I have started that way. I didn't know how fast I had to turn excel down in favour of turbo pascal. Which is a pale toolkit compared to perl.
Perl has PDL for basic matrix manipulation. If you want to go further, you will either end up with Matlab (www.matlab.com) or S-plus. Both have a very nice computation language, with extensive statistical tools. Moreover, it's really easy to write your own statistics and to plot the results. I'm a matlab user myself, but s-plus is equally fit as far as I have heard.
They both have opensource equivalents, octave and 'R'. I don't know for R, but octave is a decent clone when it comes to basic matlab stuff, but for many toolboxes and for nice graphs you'll have to stick with matlab. On the other hand, someone was posting an Inline::Octave proposal on the inline mailing list. This could be very interesting. When I start 'R' I've got myself a nice window, but I can't tell you anything about its functionality.
While I was writing the 2nd alinea, I got an idea, ran it through a perl/matlab/origin cycle and was pretty excited with the results. (Origin is my favourite graphing program). You see, I use quite some tools in parallel myself.
Feel free to /msg me if you want to know some more details.
HTH, Jeroen
*"We are not alone"(FZ)* |
Comment onData Mining with Perl