Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

segmentation and grouping

by vkkan (Initiate)
on Dec 18, 2012 at 06:48 UTC ( [id://1009288]=perlquestion: print w/replies, xml ) Need Help??

vkkan has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,
I am fairly new to perl so please forgive me if I ask lame doubt.
I have installed strawberry perl with PDL in winxp box. I am trying to do given below task.
Group customer based the service they took from us.
data: mysql table with 150K records and it will grow exponentially with columns name,phone,amountpaid,service,dateofaction
output: all I want is flag the customer with group name
Service is the text filed with free flow of texture based on that i have to form a group lets say If you find X instances of “Y” description in billing field in past “Z” months Then add to “A” group
While googling i found that cluster analysis will do the trick using Perl, can some point me right direction to learn and how to do this task?
Highly appreciated all of your help on this.
Regards,
Vijay

Replies are listed 'Best First'.
Re: segmentation and grouping
by CountZero (Bishop) on Dec 18, 2012 at 07:38 UTC
    I think you are on the wrong track. I doubt it that "cluster analysis" will help you.

    Do you still have to analyse the type of services rendered to the client and decide in which group they belong? Or is the "service" field already filled in with the group they belong to? If that is the case then a simple SQL query will be enough.

    Perhaps you can show a few sample lines of your data so we better understand what you want to do and what data is available.

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

    My blog: Imperial Deltronics
      Given below are the sample data set
      custid Name service Price Posted Date
      31 John Consultation Charges 100 4/1/2012 10:39
      805 Kennedy Consultation Charges 150 4/1/2012 11:17
      805 Kennedy C-Reactive Protein 170 4/1/2012 11:56
      805 Kennedy Complete Blood Count 150 4/1/2012 11:56
      805 Kennedy Malarial 175 4/1/2012 11:56
      805 Kennedy Mantoux Test 100 4/1/2012 11:56
      805 Kennedy AZIBACT 1 MG SYP 28 4/1/2012 13:27
      805 Kennedy FALCINILLE DRY SYP 105.15 4/1/2012 13:27
      891 Ruth Consultation Charges 150 4/1/2012 12:05
      891 Ruth C-Reactive Protein 170 4/1/2012 12:47
      891 Ruth Complete Blood Count 150 4/1/2012 12:47
      891 Ruth Mantoux Test 100 4/1/2012 12:47
      891 Ruth X-Ray Chest 150 4/1/2012 12:47


      service field not filled with group name , its just service they rendered so from above sample all three peoples can be go to consultation group, Kennedy can go to malarial etc . Hope I have provided needed information. Thanks your time CountZero

        May be I'm misunderstanding you, but is this what you want?:

        #! perl -slw use strict; use Data::Dump qw[ pp ]; my %categs; push @{ $categs{ $_->[2] } }, $_->[1] while @{ $_ = [ split ' ', <DATA +> ] }; pp \%categs; __DATA__ 31 John Consultation Charges 100 4/1/2012 10:39 805 Kennedy Consultation Charges 150 4/1/2012 11:17 805 Kennedy C-Reactive Protein 170 4/1/2012 11:56 805 Kennedy Complete Blood Count 150 4/1/2012 11:56 805 Kennedy Malarial 175 4/1/2012 11:56 805 Kennedy Mantoux Test 100 4/1/2012 11:56 805 Kennedy AZIBACT 1 MG SYP 28 4/1/2012 13:27 805 Kennedy FALCINILLE DRY SYP 105.15 4/1/2012 13:27 891 Ruth Consultation Charges 150 4/1/2012 12:05 891 Ruth C-Reactive Protein 170 4/1/2012 12:47 891 Ruth Complete Blood Count 150 4/1/2012 12:47 891 Ruth Mantoux Test 100 4/1/2012 12:47 891 Ruth X-Ray Chest 150 4/1/2012 12:47

        Producing:

        C:\test>junk59 { AZIBACT => ["Kennedy"], "C-Reactive" => ["Kennedy", "Ruth"], Complete => ["Kennedy", "Ruth"], Consultation => ["John", "Kennedy", "Ruth"], FALCINILLE => ["Kennedy"], Malarial => ["Kennedy"], Mantoux => ["Kennedy", "Ruth"], "X-Ray" => ["Ruth"], }

        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        RIP Neil Armstrong

        I see. Your file is just a list of services rendered and you must "cluster" these into different categories. It is possible to do so, but it will take some work.

        Do you have some kind of "dictionary" which tells you into which category or categories each type of service belongs? If so, then you just have to read each service and check it against the dictionary to find out into which category or categories each service belongs. Once you have done that, you check the number and type of categories for each client and put that info in some kind of "scoring" formula to find the most appropriate category.

        If you do not have a "services-to-categories" dictionary then things become much more difficult and I really do not have a good and simple solution. I once applied Bayesian statistics on a similar problem (though only with a few broad categories to put the records in) and it worked "somewhat". I got about 80% correct categorizations (and thus 20% totally wrong), but it was enough for my purpose. If I trained the algorithm a bit more I might have gotten better results. Modules such as Algorithm::NaiveBayes or AI::Categorizer::Learner::NaiveBayes are worth taking a look at.

        CountZero

        A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

        My blog: Imperial Deltronics

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1009288]
Approved by LanX
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others surveying the Monastery: (9)
As of 2024-04-24 07:16 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found