Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??
Here's an example which uses the CPAN subject categories for the training set, and then classifies modules according to which category they probably best fit into:
use strict; use warnings; require AI::Categorizer; require AI::Categorizer::Learner::NaiveBayes; require AI::Categorizer::Document; require AI::Categorizer::KnowledgeSet; require Lingua::StopWords; # set up features: # - give different weights to subjects and bodies # - use stop words my %features = (content_weights => {subject => 2, body => 1}, stopwords => Lingua::StopWords::getStopWords('en'), stemming => 'porter', ); # this is the raw data to train with, which associates # numerical categories with subjects and bodies my $chaps = { 6 => {subject => q{Data Type Utilities}, body => q{Date Time Math List Tree Algorithm Sort}, }, 10 => {subject => q{File Names Systems Locking}, body => q{Directory Dir Stat cwd}, }, 12 => {subject => q{Opt Arg Param Proc}, body => q{Option Argument Argv Config Getopt}, }, 14 => {subject => q{Security and Encryption}, body => q{Authentication Crypt Digest PGP Des}, }, 15 => {subject => q{World Wide Web HTML HTTP CGI}, body => q{WWW Apache MIME Kwiki URI URL}, }, 17 => {subject => q{Archiving and Compression}, body => q{tar gzip gz zip bzip}, }, 18 => {subject => q{Images Pixmaps Bitmaps}, body => q{Chart Graphic}, }, 19 => {subject => q{Mail and Usenet News}, body => q{Sendmail NNTP SMTP IMAP POP3 MIME}, }, }; # create documents from $chaps to train with my $docs; foreach my $cat(keys %$chaps) { $docs->{$cat} = {categories => [$cat], content => {subject => $chaps->{$cat}->{subject}, body => $chaps->{$cat}->{body}, }, }; } my $c = AI::Categorizer->new( knowledge_set => AI::Categorizer::KnowledgeSet->new( name => 'CSL'), verbose => 1, ); while (my ($name, $data) = each %$docs) { $c->knowledge_set->make_document(name => $name, %$data, %features); } my $learner = $c->learner; $learner->train; # this is a test data set to categorize, # based on the training done above my $test_set = {'Math::Complex' => {content => {subject => q{Math}, body => q{Complex number data type} } }, 'Archive::Zip' => {content => {subject => q{Compression}, body => q{Interface to ZIP archive files} } }, 'Apache2::URI' => {content => {subject => q{Apache}, body => q{Perl API for manipulating URIs} } }, 'MIME::Lite' => {content => {subject => q{Mail}, body => q{Create MIME/SMTP mails w/attachements} } }, }; # see what category each element of $test_set gets put into, # using a threshold score of 0.9 my $threshold = 0.9; while (my ($name, $data) = each %$test_set) { my $doc = AI::Categorizer::Document->new(name => $name, content => $data->{content}, %features); my $r = $learner->categorize($doc); $r->threshold($threshold); my $b = $r->best_category; next unless $r->in_category($b); printf("%s is in category %d, with score %.3f\n", $name, $b, $r->scores($b)); }
This produces
Archive::Zip is in category 17, with score 0.998 Apache2::URI is in category 15, with score 0.917 MIME::Lite is in category 19, with score 1.000 Math::Complex is in category 6, with score 0.997

In reply to Re: help with AI::Categorizer by randyk
in thread help with AI::Categorizer by downer

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others romping around the Monastery: (7)
As of 2024-03-28 10:44 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found