Comrite Unix Man page/Perldoc/Info page, English-Chinese Dictionary, Chinese-English Dictionary

AI::Categorizer::Learner::NaiveBayes--3pm

Command: man perldoc info search(apropos)  


 
AI::Categorizer::LearnUserNContributedAIP:e:Categorizer::Learner::NaiveBayes(3pm)



NAME
       AI::Categorizer::Learner::NaiveBayes - Naive Bayes Algorithm For
       AI::Categorizer

SYNOPSIS
         use AI::Categorizer::Learner::NaiveBayes;

         # Here $k is an AI::Categorizer::KnowledgeSet object

         my $nb = new AI::Categorizer::Learner::NaiveBayes(...parameters...);
         $nb->train(knowledge_set => $k);
         $nb->save_state('filename');

         ... time passes ...

         $nb = AI::Categorizer::Learner::NaiveBayes->restore_state('filename');
         my $c = new AI::Categorizer::Collection::Files( path => ... );
         while (my $document = $c->next) {
           my $hypothesis = $nb->categorize($document);
           print "Best assigned category: ", $hypothesis->best_category, "\n";
           print "All assigned categories: ", join(', ', $hypothesis->categories), "\n";
         }

DESCRIPTION
       This is an implementation of the Naive Bayes decision-making algorithm,
       applied to the task of document categorization (as defined by the
       AI::Categorizer module).  See AI::Categorizer for a complete descrip-
       tion of the interface.

       This module is now a wrapper around the stand-alone "Algorithm::Naive-
       Bayes" module.  I moved the discussion of Bayes' Theorem into that mod-
       ule's documentation.

METHODS
       This class inherits from the "AI::Categorizer::Learner" class, so all
       of its methods are available unless explicitly mentioned here.

       new()

       Creates a new Naive Bayes Learner and returns it.  In addition to the
       parameters accepted by the "AI::Categorizer::Learner" class, the Naive
       Bayes subclass accepts the following parameters:

       * threshold
           Sets the score threshold for category membership.  The default is
           currently 0.3.  Set the threshold lower to assign more categories
           per document, set it higher to assign fewer.  This can be an effec-
           tive way to trade of between precision and recall.

       threshold()

       Returns the current threshold value.  With an optional numeric argu-
       ment, you may set the threshold.

       train(knowledge_set => $k)

       Trains the categorizer.  This prepares it for later use in categorizing
       documents.  The "knowledge_set" parameter must provide an object of the
       class "AI::Categorizer::KnowledgeSet" (or a subclass thereof), popu-
       lated with lots of documents and categories.  See AI::Catego-
       rizer::KnowledgeSet for the details of how to create such an object.

       categorize($document)

       Returns an "AI::Categorizer::Hypothesis" object representing the cate-
       gorizer's "best guess" about which categories the given document should
       be assigned to.  See AI::Categorizer::Hypothesis for more details on
       how to use this object.

       save_state($path)

       Saves the categorizer for later use.  This method is inherited from
       "AI::Categorizer::Storable".

CALCULATIONS
       The various probabilities used in the above calculations are found
       directly from the training documents.  For instance, if there are 5000
       total tokens (words) in the "sports" training documents and 200 of them
       are the word "curling", then "P(curling|sports) = 200/5000 = 0.04" .
       If there are 10,000 total tokens in the training corpus and 5,000 of
       them are in documents belonging to the category "sports", then
       "P(sports)" = 5,000/10,000 = 0.5> .

       Because the probabilities involved are often very small and we multiply
       many of them together, the result is often a tiny tiny number.  This
       could pose problems of floating-point underflow, so instead of working
       with the actual probabilities we work with the logarithms of the proba-
       bilities.  This also speeds up various calculations in the "catego-
       rize()" method.

TO DO
       More work on the confidence scores - right now the winning category
       tends to dominate the scores overwhelmingly, when the scores should
       probably be more evenly distributed.

AUTHOR
       Ken Williams, ken AT forum.edu

COPYRIGHT
       Copyright 2000-2003 Ken Williams.  All rights reserved.

       This library is free software; you can redistribute it and/or modify it
       under the same terms as Perl itself.

SEE ALSO
       AI::Categorizer(3), Algorithm::NaiveBayes(3)

       "A re-examination of text categorization methods" by Yiming Yang
       <http://www.cs.cmu.edu/~yiming/publications.html>;

       "On the Optimality of the Simple Bayesian Classifier under Zero-One
       Loss" by Pedro Domingos "http://www.cs.washing-
       ton.edu/homes/pedrod/mlj97.ps.gz"

       A simple but complete example of Bayes' Theorem from Dr. Math
       "http://www.mathforum.com/dr.math/problems/battisfore.03.22.99.html"



perl v5.8.7                       200AI::Categorizer::Learner::NaiveBayes(3pm)
 

©2005 Comrite