Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

Clustering/classifying recommendations

by f77coder (Beadle)
on Aug 19, 2014 at 16:30 UTC ( [id://1097999]=perlquestion: print w/replies, xml ) Need Help??

f77coder has asked for the wisdom of the Perl Monks concerning the following question:

Hello

I'm interested in recommendations for clustering with attributes of being fast over lightweight/small. So I'd prefer loops over one-liners if the loop can be executed faster. Now I'm looking through the large list of various CPAN archives (AI, Bayes, Cluster, etc) and would like narrow down the search. I don't mind getting the source code and having to hack if doesn't quite match what I need to do rather than having an expectation of something work as is.

The input data is a mixture of integers and strings, all categorical data. I'd like to look at each data line as an array and do vector processing, think of it as a 1d image processing problem, how many pixels are different.

For example,

line1=> cat1=123, cat2=92, cat3=5, cat4='0xffa411', cat5='0x221133', cat6='0xa291f1'

line2=> cat1=3, cat2=92, cat3=5, cat4='0xaf1401', cat5='0xaaffcc', cat6='0xa23af1'

I'd like to create a distance measurement based only on the number of categories that are different, in this case, the distance map would be (cat2,cat3,4). There will probably be a weighting function applied to this metric as well.

Once the training is complete then for a new line make a prediction with the classify/cluster.

Thanks

Replies are listed 'Best First'.
Re: Clustering/classifying recommendations
by Laurent_R (Canon) on Aug 19, 2014 at 20:24 UTC
    Hmm, your requirement is not very clear to me (and probably to other monks as well, judging from the answers you've got so far), but if you want to compare lists, it seems to me that the List::Util and List::MoreUtils CPAN modules might be the first place to go.

      Essentially think of each input line represents a point in an N-dimensional space. I want to classify/cluster these points and need a metric to measure the separation/distance.

      one point= (category 1, category 2, category 3…. category N)

      Bioperl tied with Bayes might do the trick

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1097999]
Approved by Jim
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others lurking in the Monastery: (2)
As of 2024-04-20 04:21 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found