Comrite Unix Man page/Perldoc/Info page, English-Chinese Dictionary, Chinese-English Dictionary

AI::Categorizer::KnowledgeSet--3pm

Command: man perldoc info search(apropos)  


 
AI::Categorizer::KnowlUserSContributed Perl AI::Categorizer::KnowledgeSet(3pm)



NAME
       AI::Categorizer::KnowledgeSet - Encapsulates set of documents

SYNOPSIS
        use AI::Categorizer::KnowledgeSet;
        my $k = new AI::Categorizer::KnowledgeSet(...parameters...);
        my $nb = new AI::Categorizer::Learner::NaiveBayes(...parameters...);
        $nb->train(knowledge_set => $k);

DESCRIPTION
       The KnowledgeSet class that provides an interface to a set of docu-
       ments, a set of categories, and a mapping between the two.  Many param-
       eters for controlling the processing of documents are managed by the
       KnowledgeSet class.

METHODS
       new()
           Creates a new KnowledgeSet and returns it.  Accepts the following
           parameters:

           load
               If a "load" parameter is present, the "load()" method will be
               invoked immediately.  If the "load" parameter is a string, it
               will be passed as the "path" parameter to "load()".  If the
               "load" parameter is a hash reference, it will represent all the
               parameters to pass to "load()".

           categories
               An optional reference to an array of Category objects repre-
               senting the complete set of categories in a KnowledgeSet.  If
               used, the "documents" parameter should also be specified.

           documents
               An optional reference to an array of Document objects repre-
               senting the complete set of documents in a KnowledgeSet.  If
               used, the "categories" parameter should also be specified.

           features_kept
               A number indicating how many features (words) should be consid-
               ered when training the Learner or categorizing new documents.
               May be specified as a positive integer (e.g. 2000) indicating
               the absolute number of features to be kept, or as a decimal
               between 0 and 1 (e.g. 0.2) indicating the fraction of the total
               number of features to be kept, or as 0 to indicate that no fea-
               ture selection should be done and that the entire set of fea-
               tures should be used.  The default is 0.2.

           feature_selection
               A string indicating the type of feature selection that should
               be performed.  Currently the only option is also the default
               option: "document_frequency".

           tfidf_weighting
               Specifies how document word counts should be converted to vec-
               tor values.  Uses the three-character specification strings
               from Salton & Buckley's paper "Term-weighting approaches in
               automatic text retrieval".  The three characters indicate the
               three factors that will be multiplied for each feature to find
               the final vector value for that feature.  The default weighting
               is "xxx".

               The first character specifies the "term frequency" component,
               which can take the following values:

               b   Binary weighting - 1 for terms present in a document, 0 for
                   terms absent.

               t   Raw term frequency - equal to the number of times a feature
                   occurs in the document.

               x   A synonym for 't'.

               n   Normalized term frequency - 0.5 + 0.5 * t/max(t).  This is
                   the same as the 't' specification, but with term frequency
                   normalized to lie between 0.5 and 1.

               The second character specifies the "collection frequency" com-
               ponent, which can take the following values:

               f   Inverse document frequency - multiply term "t"'s value by
                   "log(N/n)", where "N" is the total number of documents in
                   the collection, and "n" is the number of documents in which
                   term "t" is found.

               p   Probabilistic inverse document frequency - multiply term
                   "t"'s value by "log((N-n)/n)" (same variable meanings as
                   above).

               x   No change - multiply by 1.

               The third character specifies the "normalization" component,
               which can take the following values:

               c   Apply cosine normalization - multiply by 1/length(docu-
                   ment_vector).

               x   No change - multiply by 1.

               The three components may alternatively be specified by the
               "term_weighting", "collection_weighting", and "normal-
               ize_weighting" parameters respectively.

           verbose
               If set to a true value, some status/debugging information will
               be output on "STDOUT".

       categories()
           In a list context returns a list of all Category objects in this
           KnowledgeSet.  In a scalar context returns the number of such
           objects.

       documents()
           In a list context returns a list of all Document objects in this
           KnowledgeSet.  In a scalar context returns the number of such
           objects.

       document()
           Given a document name, returns the Document object with that name,
           or "undef" if no such Document object exists in this KnowledgeSet.

       features()
           Returns a FeatureSet object which represents the features of all
           the documents in this KnowledgeSet.

       verbose()
           Returns the "verbose" parameter of this KnowledgeSet, or sets it
           with an optional argument.

       scan_stats()
           Scans all the documents of a Collection and returns a hash refer-
           ence containing several statistics about the Collection.  (XXX need
           to describe stats)

       scan_features()
           This method scans through a Collection object and determines the
           "best" features (words) to use when loading the documents and
           training the Learner.  This process is known as "feature selec-
           tion", and it's a very important part of categorization.

           The Collection object should be specified as a "collection" parame-
           ter, or by giving the arguments to pass to the Collection's "new()"
           method.

           The process of feature selection is governed by the "feature_selec-
           tion" and "features_kept" parameters given to the KnowledgeSet's
           "new()" method.

           This method returns the features as a FeatureVector whose values
           are the "quality" of each feature, by whatever measure the "fea-
           ture_selection" parameter specifies.  Normally you won't need to
           use the return value, because this FeatureVector will become the
           "use_features" parameter of any Document objects created by this
           KnowledgeSet.

       save_features()
           Given the name of a file, this method writes the features (as
           determined by the "scan_features" method) to the file.

       restore_features()
           Given the name of a file written by "save_features", loads the fea-
           tures from that file and passes them as the "use_features" parame-
           ter for any Document objects created in the future by this Knowl-
           edgeSet.

       read()
           Iterates through a Collection of documents and adds them to the
           KnowledgeSet.  The Collection can be specified using a "collection"
           parameter - otherwise, specify the arguments to pass to the "new()"
           method of the Collection class.

       load()
           This method can do feature selection and load a Collection in one
           step (though it currently uses two steps internally).

       add_document()
           Given a Document object as an argument, this method will add it and
           any categories it belongs to to the KnowledgeSet.

       make_document()
           This method will create a Document object with the given data and
           then call "add_document()" to add it to the KnowledgeSet.  A "cate-
           gories" parameter should specify an array reference containing a
           list of categories by name.  These are the categories that the doc-
           ument belongs to.  Any other parameters will be passed to the Docu-
           ment class's "new()" method.

       finish()
           This method will be called prior to training the Learner.  Its pur-
           pose is to perform any operations (such as feature vector weight-
           ing) that may require examination of the entire KnowledgeSet.

       weigh_features()
           This method will be called during "finish()" to adjust the weights
           of the features according to the "tfidf_weighting" parameter.

       document_frequency()
           Given a single feature (word) as an argument, this method will
           return the number of documents in the KnowledgeSet that contain
           that feature.

       partition()
           Divides the KnowledgeSet into several subsets.  This may be useful
           for performing cross-validation.  The relative sizes of the subsets
           should be passed as arguments.  For example, to split the Knowl-
           edgeSet into four KnowledgeSets of equal size, pass the arguments
           .25, .25, .25 (the final size is 1 minus the sum of the other
           sizes).  The partitions will be returned as a list.

AUTHOR
       Ken Williams, ken AT mathforum.org

COPYRIGHT
       Copyright 2000-2003 Ken Williams.  All rights reserved.

       This library is free software; you can redistribute it and/or modify it
       under the same terms as Perl itself.

SEE ALSO
       AI::Categorizer(3)



perl v5.8.7                       2002-11-24AI::Categorizer::KnowledgeSet(3pm)
 

©2005 Comrite