Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw

AI::Categorizer help

by Anonymous Monk
on Mar 24, 2013 at 14:45 UTC ( #1025150=perlquestion: print w/replies, xml ) Need Help??
Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi monks,
I want to use AI::Categorizer for my research, but I'm not able to use it with the reuters21578 dataset. I'm tring to use the file given with the module.What exactly should I have in the test set, training set and cats.txt file?
If any monks out there have used this module before,could you provide an example of any working code.
Thank you.<br> #!/usr/bin/perl # This script is a fairly simple demonstration of how AI::Categorizer # can be used. There are lots of other less-simple demonstrations # (actually, they're doing much simpler things, but are probably # harder to follow) in the tests in the t/ subdirectory. The # eg/categorizer script can also be a good example if you're willing # to figure out a bit how it works. # # This script reads a training corpus from a directory of plain-text # documents, trains a Naive Bayes categorizer on it, then tests the # categorizer on a set of test documents. use strict; use AI::Categorizer; use AI::Categorizer::Collection::Files; use AI::Categorizer::Learner::NaiveBayes; use File::Spec; die("Usage: $0 <corpus>\n". " A sample corpus (data set) can be downloaded from\n". " +tar.gz\n". " or\n") unless @ARGV == 1; my $corpus = shift; my $training = File::Spec->catfile( $corpus, 'training' ); my $test = File::Spec->catfile( $corpus, 'test' ); my $cats = File::Spec->catfile( $corpus, 'cats.txt' ); my $stopwords = File::Spec->catfile( $corpus, 'stopwords' ); my %params; if (-e $stopwords) { $params{stopword_file} = $stopwords; } else { warn "$stopwords not found - no stopwords will be used.\n"; } if (-e $cats) { $params{category_file} = $cats; } else { die "$cats not found - can't proceed without category information.\n +"; } # In a real-world application these Collection objects could be of any # type (any Collection subclass). Or you could create each Document # object manually. Or you could let the KnowledgeSet create the # Collection objects for you. $training = AI::Categorizer::Collection::Files->new( path => $training +, %params ); $test = AI::Categorizer::Collection::Files->new( path => $test, %p +arams ); # We turn on verbose mode so you can watch the progress of loading & # training. This looks nicer if you have Time::Progress installed! print "Loading training set\n"; my $k = AI::Categorizer::KnowledgeSet->new( verbose => 1 ); $k->load( collection => $training ); print "Training categorizer\n"; my $l = AI::Categorizer::Learner::NaiveBayes->new( verbose => 1 ); $l->train( knowledge_set => $k ); print "Categorizing test set\n"; my $experiment = $l->categorize_collection( collection => $test ); print $experiment->stats_table; # If you want to get at the specific assigned categories for a # specific document, you can do it like this: my $doc = AI::Categorizer::Document->new ( content => "Hello, I am a pretty generic document with not much to + say." ); my $h = $l->categorize( $doc ); print ("For test document:\n", " Best category = ", $h->best_category, "\n", " All categories = ", join(', ', $h->categories), "\n");

Replies are listed 'Best First'.
Re: AI::Categorizer help
by toolic (Bishop) on Mar 24, 2013 at 16:11 UTC

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1025150]
Approved by toolic
and all is quiet...

How do I use this? | Other CB clients
Other Users?
Others contemplating the Monastery: (7)
As of 2018-06-20 20:13 GMT
Find Nodes?
    Voting Booth?
    Should cpanminus be part of the standard Perl release?

    Results (117 votes). Check out past polls.