Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??

That's a good idea kschwab. I can explain some things but I have no experience with either module.

In the train() mode, you need to pass on a lot of data cases, each in the form of an array of hashrefs, each as the one you already have in your post:

{ attributes => { phone => 1, 'last name' => 1, 'fname' => 1, mobile => 1 }, labels => ['has header'] },

which means that, in this case predictor "phone" has a weight of 1, "last name" the same, etc. And you, the human, classified this case as "has header".

What does a weight mean? Let's say here in your case it is the number of times it occured in your single data case. Each data case will have its own weights for each predictor. Weight can be other things or a combination, for example: number of times it occurs, whether it is capitalised, whether it is at the beginning of a sentence etc.etc.

And on you continue with your next data case. etc. Ideally you should represent all labels, "has header" and I guess, "has no header". All these in a single hash array (of the hashrefs mentioned above) to be given as parameter to train()

Then it's time to classify some unknown cases. Using the couplet:

my $result = $classifier->classify({phone => 3, fname => 0, ...}); my $best_category = $result->best_category;

$best_category will be one of "has header", "has no header" for that particular data case you classify(). The classifier $result can tell you also what influence each field/predictor has using my $predictors = $result->find_predictors; (see AI::NaiveBayes::Classification)

The trick is to find some predictors that you think differentiate the two labels. For example one has far fewer "phone" and the other has a lot. Then a weight for each of the predictors has to be calculated by you, or naively put the number of occurences in each data case you have. Just to start. I am not sure of predictors with zero weight for that particular data case have to be mentioned in train() or will be inferred and set to zero if at least one data case mentions them and others do not. I think they will be inferred if absent from particular data case but present in at least one other data case.

Forgot to mention that a data case can belong to many labels! That's why you have that arrayref in labels => [...] (note: data case = data row = a single observation)

Code taken from AI::NaiveBayes

bw, bliako


In reply to Re^3: Module for intelligently analyzing and merging spreadsheet data by bliako
in thread Module for intelligently analyzing and merging spreadsheet data by nysus

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others pondering the Monastery: (6)
As of 2024-04-19 15:06 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found