Comrite Unix Man page/Perldoc/Info page, English-Chinese Dictionary, Chinese-English Dictionary

AI::Categorizer::Document--3pm

Command: man perldoc info search(apropos)  


 
AI::Categorizer::DocumUser3Contributed Perl DocuAI::Categorizer::Document(3pm)



NAME
       AI::Categorizer::Document - Embodies a document

SYNOPSIS
        use AI::Categorizer::Document;

        # Simplest way to create a document:
        my $d = new AI::Categorizer::Document(name => $string,
                                              content => $string);

        # Other parameters are accepted:
        my $d = new AI::Categorizer::Document(name => $string,
                                              categories => \@category_objects,
                                              content => { subject => $string,
                                                           body => $string2, ... },
                                              content_weights => { subject => 3,
                                                                   body => 1, ... },
                                              stopwords => \%skip_these_words,
                                              stemming => $string,
                                              front_bias => $float,
                                              use_features => $feature_vector,
                                             );

        # Specify explicit feature vector:
        my $d = new AI::Categorizer::Document(name => $string);
        $d->features( $feature_vector );

        # Now pass the document to a categorization algorithm:
        my $learner = AI::Categorizer::Learner::NaiveBayes->restore_state($path);
        my $hypothesis = $learner->categorize($document);

DESCRIPTION
       The Document class embodies the data in a single document, and contains
       methods for turning this data into a FeatureVector.  Usually documents
       are plain text, but subclasses of the Document class may handle any
       kind of data.

METHODS
       new(%parameters)
           Creates a new Document object.  Document objects are used during
           training (for the training documents), testing (for the test docu-
           ments), and when categorizing new unseen documents in an applica-
           tion (for the unseen documents).  However, you'll typically only
           call "new()" in the latter case, since the KnowledgeSet or Collec-
           tion classes will create Document objects for you in the former
           cases.

           The "new()" method accepts the following parameters:

           name
               A string that identifies this document.  Required.

           content
               The raw content of this document.  May be specified as either a
               string or as a hash reference, allowing structured document
               types.

           content_weights
               A hash reference indicating the weights that should be assigned
               to features in different sections of a structured document when
               creating its feature vector.  The weight is a multiplier of the
               feature vector values.  For instance, if a "subject" section
               has a weight of 3 and a "body" section has a weight of 1, and
               word counts are used as feature vector values, then it will be
               as if all words appearing in the "subject" appeared 3 times.

               If no weights are specified, all weights are set to 1.

           front_bias
               Allows smooth bias of the weights of words in a document
               according to their position.  The value should be a number
               between -1 and 1.  Positive numbers indicate that words toward
               the beginning of the document should have higher weight than
               words toward the end of the document.  Negative numbers indi-
               cate the opposite.  A bias of 0 indicates that no biasing
               should be done.

           categories
               A reference to an array of Category objects that this document
               belongs to.  Optional.

           stopwords
               A list/hash of features (words) that should be ignored when
               parsing document content.  A hash reference is preferred, with
               the features as the keys.  If you pass an array reference con-
               taining the features, it will be converted to a hash reference
               internally.

           use_features
               A Feature Vector specifying the only features that should be
               considered when parsing this document.  This is an alternative
               to using "stopwords".

           stemming
               Indicates the linguistic procedure that should be used to con-
               vert tokens in the document to features.  Possible values are
               "none", which indicates that the tokens should be used without
               change, or "porter", indicating that the Porter stemming algo-
               rithm should be applied to each token.  This requires the "Lin-
               gua::Stem" module from CPAN.

           stopword_behavior
               There are a few ways you might want the stopword list (speci-
               fied with the "stopwords" parameter) to interact with the stem-
               ming algorithm (specified with the "stemming" parameter).
               These options can be controlled with the "stopword_behavior"
               parameter, which can take the following values:

               no_stem
                   Match stopwords against non-stemmed document words.

               stem
                   Stem stopwords according to 'stemming' parameter, then
                   match them against stemmed document words.

               pre_stemmed
                   Stopwords are already stemmed, match them against stemmed
                   document words.

               The default value is "stem", which seems to produce the best
               results in most cases I've tried.  I'm not aware of any studies
               comparing the "no_stem" behavior to the "stem" behavior in the
               general case.

               This parameter has no effect if there are no stopwords being
               used, or if stemming is not being used.  In the latter case,
               the list of stopwords will always be matched as-is against the
               document words.

               Note that if the "stem" option is used, the data structure
               passed as the "stopwords" parameter will be modified in-place
               to contain the stemmed versions of the stopwords supplied.

       read( path => $path )
           An alternative constructor method which reads a file on disk and
           returns a document with that file's contents.

       parse( content => $content )
       name()
           Returns this document's "name" property as specified when the docu-
           ment was created.

       features()
           Returns the Feature Vector associated with this document.

       categories()
           In a list context, returns a list of Category objects to which this
           document belongs.  In a scalar context, returns the number of such
           categories.

       create_feature_vector()
           Creates this document's Feature Vector by parsing its content.  You
           won't call this method directly, it's called by "new()".

AUTHOR
       Ken Williams <ken AT mathforum.org>

COPYRIGHT
       This distribution is free software; you can redistribute it and/or mod-
       ify it under the same terms as Perl itself.  These terms apply to every
       file in the distribution - if you have questions, please contact the
       author.



perl v5.8.7                       2002-11-24    AI::Categorizer::Document(3pm)
 

©2005 Comrite