Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much

Re: Module for intelligently analyzing and merging spreadsheet data

by kschwab (Vicar)
on Feb 10, 2019 at 21:51 UTC ( #1229746=note: print w/replies, xml ) Need Help??

in reply to Module for intelligently analyzing and merging spreadsheet data

Might be neat to use a Bayesian classifier to guess which existing column the data should be merged into. Use the column names as labels, and train on the data already in the main sheet. See Algorithm::NaiveBayes for a generic Perl implementation and this node for a short example of how to use it.
  • Comment on Re: Module for intelligently analyzing and merging spreadsheet data

Replies are listed 'Best First'.
Re^2: Module for intelligently analyzing and merging spreadsheet data
by nysus (Parson) on Feb 10, 2019 at 23:45 UTC

    Thanks, yeah. I was actually looking at some of the Bayesian modules to see if they could help me determine if the spreadsheet had a header row. I couldn't quite figure out how to do it, though. And there seems to be two different Bayesian modules,the one you mentioned and this one: AI::NaiveBayes

    I'm not sure which to use or if they do anything different.

    Here is the code I was experimenting with:

    my $classifier = AI::NaiveBayes->train( { attributes => { phone => 1, 'last name' => 1, 'fname' => 1, mobile => 1 }, labels => ['has header'] }, ); # Classify a feature vector my $result = $classifier->classify({fname => 5}); # $result is now a AI::NaiveBayes::Classification object my @predictors = $result->find_predictors;

    I wasn't sure where to go from there or how to better train it.

    $PM = "Perl Monk's";
    $MCF = "Most Clueless Friar Abbot Bishop Pontiff Deacon Curate Priest Vicar";
    $nysus = $PM . ' ' . $MCF;
    Click here if you love Perl Monks

      That's a good idea kschwab. I can explain some things but I have no experience with either module.

      In the train() mode, you need to pass on a lot of data cases, each in the form of an array of hashrefs, each as the one you already have in your post:

      { attributes => { phone => 1, 'last name' => 1, 'fname' => 1, mobile => 1 }, labels => ['has header'] },

      which means that, in this case predictor "phone" has a weight of 1, "last name" the same, etc. And you, the human, classified this case as "has header".

      What does a weight mean? Let's say here in your case it is the number of times it occured in your single data case. Each data case will have its own weights for each predictor. Weight can be other things or a combination, for example: number of times it occurs, whether it is capitalised, whether it is at the beginning of a sentence etc.etc.

      And on you continue with your next data case. etc. Ideally you should represent all labels, "has header" and I guess, "has no header". All these in a single hash array (of the hashrefs mentioned above) to be given as parameter to train()

      Then it's time to classify some unknown cases. Using the couplet:

      my $result = $classifier->classify({phone => 3, fname => 0, ...}); my $best_category = $result->best_category;

      $best_category will be one of "has header", "has no header" for that particular data case you classify(). The classifier $result can tell you also what influence each field/predictor has using my $predictors = $result->find_predictors; (see AI::NaiveBayes::Classification)

      The trick is to find some predictors that you think differentiate the two labels. For example one has far fewer "phone" and the other has a lot. Then a weight for each of the predictors has to be calculated by you, or naively put the number of occurences in each data case you have. Just to start. I am not sure of predictors with zero weight for that particular data case have to be mentioned in train() or will be inferred and set to zero if at least one data case mentions them and others do not. I think they will be inferred if absent from particular data case but present in at least one other data case.

      Forgot to mention that a data case can belong to many labels! That's why you have that arrayref in labels => [...] (note: data case = data row = a single observation)

      Code taken from AI::NaiveBayes

      bw, bliako

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1229746]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others musing on the Monastery: (3)
As of 2022-10-02 10:27 GMT
Find Nodes?
    Voting Booth?
    My preferred way to holiday/vacation is:

    Results (8 votes). Check out past polls.