Comrite Unix Man page/Perldoc/Info page, English-Chinese Dictionary, Chinese-English Dictionary

crossbow

Command: man perldoc info search(apropos)  


 
CROSSBOW(1)                      User Commands                     CROSSBOW(1)



NAME
       crossbow  -  a front-end with hierarchical clustering and deterministic
       annealing

SYNOPSIS
       crossbow [OPTION...] [ARG...]

DESCRIPTION
       Crossbow is document clustering front-end to libbow. This brief manpage
       was  written  for the Debian GNU/Linux distribution since there is none
       available in the main package.

       Note that crossbow is not a supported program.

OPTIONS
              For building data structures from text files:

       --build-hier-from-dir
              When indexing a single directory, use the directory structure to
              build a class hierarchy

       -c, --cluster
              cluster the documents, and write the results to disk

       --classify
              Split the data into train/test, and classify the test data, out-
              puting results in rainbow format

       --classify-files=DIRNAME
              Classify documents in DIRNAME,  outputing  `filename  classname'
              pairs on each line.

       --cluster-output-dir=DIR
              After clustering is finished, write the cluster to directory DIR

       -i, --index
              tokenize training documents found  under  ARG...,  build  weight
              vectors, and save them to disk

       --index-multiclass-list=FILE
              Index  the  files listed in FILE.  Each line of FILE should con-
              tain a filenames followed by a list of classnames to which  that
              file belongs.

       --print-doc-names[=TAG]
              Print the filenames of documents contained in the model.  If the
              optional TAG argument is given, print only  the  documents  that
              have the specified tag.

       --print-matrix
              Print the word/document count matrix in an awk- or perl-accessi-
              ble format.  Format is sparse and includes  the  words  and  the
              counts.

       --print-word-probabilities=FILEPREFIX
              Print  the  word  probability distribution in each leaf to files
              named FILEPREFIX-classname

       --query-server=PORTNUM Run crossbow in server mode, listening on socket
              number PORTNUM.  You can try it by executing this command,  then
              in  a  different shell window on the same machine typing `telnet
              localhost PORTNUM'.

       --use-vocab-in-file=FILENAME
              Limit vocabulary to just those  words  read  as  space-separated
              strings from FILE.

              Splitting options:

       --ignore-set=SOURCE
              How to select the ignored documents.  Same format as --test-set.
              Default is `0'.

       --set-files-use-basename[=N]
              When using files to specify doc types, compare only the  last  N
              components the doc's pathname.  That is use the filename and the
              last N-1 directory names.  If N is not specified, it defaults to
              1.

       --test-set=SOURCE
              How  to  select the testing documents.  A number between 0 and 1
              inclusive with a decimal point indicates a  random  fraction  of
              all documents.  The number of documents selected from each class
              is determined by attempting to match the proportions of the non-
              ignore  documents.  A number with no decimal point indicates the
              number of documents to select randomly.  Alternatively, a suffix
              of `pc' indicates the number of documents per-class to tag.  The
              suffix 't' for a number or proportion indicates to tag documents
              from the pool of training documents, not the untagged documents.
              `remaining' selects all documents that remain  untagged  at  the
              end.   Anything  else is interpreted as a filename listing docu-
              ments to select.  Default is `0.0'.

       --train-set=SOURCE
              How  to  select  the  training  documents.    Same   format   as
              --test-set.  Default is `remaining'.

       --unlabeled-set=SOURCE How to select the unlabeled documents.
              Same format as --test-set.  Default is `0'.

       --validation-set=SOURCE
              How   to  select  the  validation  documents.   Same  format  as
              --test-set.  Default is `0'.

              Hierarchical EM Clustering options:

       --hem-branching-factor=NUM
              Number of clusters to create.  Default is 2.

       --hem-deterministic-horizontal
              In the horizontal E-step for a document, set to zero the member-
              ship  probabilities  of  all leaves, except the one matching the
              document's filename

       --hem-garbage-collection
              Add extra /Misc/ children to every internal node of the  hierar-
              chy, and keep their local word distributions flat

       --hem-incremental-labeling
              Instead of using all unlabeled documents in the M-step, use only
              the labeled documents, and incrementally label  those  unlabeled
              documents that are most confidently classified in the E-step

       --hem-lambdas-from-validation=NUM
              Instead  of  setting the lambdas from the labeled/unlabeled data
              (possibly with LOO), instead set the lambdas using held-out val-
              idation  data.   0<NUM<1  is the fraction of unlabeled documents
              just before EM training of the classifier begins.  Default is 0,
              which leaves this option off.

       --hem-max-num-iterations=NUM
              Do no more iterations of EM than this.

       --hem-maximum-depth=NUM
              The  hierarchy depth beyond which it will not split.  Default is
              6.

       --hem-no-loo
              Do not use leave-one-out evaluation during the E-step.

       --hem-no-shrinkage
              Use only the clusters at the leaves; do not do anything with the
              hierarchy.

       --hem-no-vertical-word-movement
              Use  EM just to set the vertical priors, not to set the vertical
              word distribution; i.e. do not to `full-EM'.

       --hem-pseudo-labeled
              After using the labels to set the starting point for EM,  change
              all training documents to unlabeled, so that they can have their
              class labels re-assigned by EM.  Useful for imperfectly  labeled
              training data.

       --hem-restricted-horizontal
              In the horizontal E-step for a document, set to zero the member-
              ship probabilities of all leaves whose names are  not  found  in
              the document's filename

       --hem-split-kl-threshold=NUM
              KL  divergence value at which tree leaves will be split. Default
              is 0.2

       --hem-temperature-decay=NUM
              Temperature decay factor.  Default is 0.9.

       --hem-temperature-end=NUM
              The final value of T.  Default is 1.

       --hem-temperature-start=NUM
              The initial value of T.

              General options

       --annotations=FILE
              The sarray file containing annotations  for  the  files  in  the
              index

       -b, --no-backspaces
              Don't  use backspace when verbosifying progress (good for use in
              emacs)

       -d, --data-dir=DIR
              Set the  directory  in  which  to  read/write  word-vector  data
              (default=~/.<program_name>).

       --random-seed=NUM
              The  non-negative  integer  to use for seeding the random number
              generator

       --score-precision=NUM
              The number of decimal digits to print when  displaying  document
              scores

       -v, --verbosity=LEVEL
              Set  amount  of  info printed while running; (0=silent, 1=quiet,
              2=show-progess,...5=max)

              Lexing options

       --append-stoplist-file=FILE
              Add words in FILE to the stoplist.

       --exclude-filename=FILENAME
              When scanning directories for text files, skip files  with  name
              matching FILENAME.

       -g, --gram-size=N
              Create tokens for all 1-grams,... N-grams.

       -h, --skip-header
              Avoid  lexing  news/mail  headers  by scanning forward until two
              newlines.

       --istext-avoid-uuencode
              Check for uuencoded blocks before saying that the file is  text,
              and say no if there are many lines of the same length.

       --lex-pipe-command=SHELLCMD
              Pipe files through this shell command before lexing them.

       --max-num-words-per-document=N
              Only tokenize the first N words in each document.

       --no-stemming
              Do not modify lexed words with a stemming function. (usually the
              default, depending on lexer)

       --replace-stoplist-file=FILE
              Empty the default stoplist, and add space-delimited  words  from
              FILE.

       -s, --no-stoplist
              Do not toss lexed words that appear in the stoplist.

       --shortest-word=LENGTH Toss lexed words that are shorter than LENGTH.
              Default is usually 2.

       -S, --use-stemming
              Modify lexed words with the `Porter' stemming function.

       --use-stoplist
              Toss  lexed  words  that  appear  in the stoplist.  (usually the
              default SMART stoplist, depending on lexer)

       --use-unknown-word
              When used in conjunction with -O or -D, captures all words  with
              occurrence counts below threshold as the `<unknown>' token

       --xxx-words-only
              Only tokenize words with `xxx' in them

              Mutually exclusive choice of lexers

       --flex-mail
              Use a mail-specific flex lexer

       --flex-tagged
              Use a tagged flex lexer

       -H, --skip-html
              Skip HTML tokens when lexing.

       --lex-alphanum
              Use  a  special lexer that includes digits in tokens, delimiting
              tokens only by non-alphanumeric characters.

       --lex-infix-string=ARG Use only the characters after ARG in  each  word
       for
              stoplisting and stemming.  If a word does not contain  ARG,  the
              entire word is used.

       --lex-suffixing
              Use  a special lexer that adds suffixes depending on Email-style
              headers.

       --lex-white
              Use a special lexer that delimits tokens by whitespace only, and
              does  not  change the contents of the token at all---no downcas-
              ing, no stemming, no stoplist, nothing.  Ideal for use  with  an
              externally-written    lexer    interfaced    to   rainbow   with
              --lex-pipe-cmd.

              Feature-selection options

       -D, --prune-vocab-by-doc-count=N
              Remove words that occur in N or fewer documents.

       -O, --prune-vocab-by-occur-count=N
              Remove words that occur less than N times.

       -T, --prune-vocab-by-infogain=N
              Remove all but the top N words by selecting words  with  highest
              information gain.

              Weight-vector setting/scoring method options

       --binary-word-counts
              Instead  of  using  integer  occurrence  counts  of words to set
              weights, use binary absence/presence.

       --event-document-then-word-document-length=NUM
              Set the normalized length of documents when  --event-model=docu-
              ment-then-word

       --event-model=EVENTNAME
              Set  what  objects will be considered the `events' of the proba-
              bilistic model.  EVENTNAME can be one of: word, document,  docu-
              ment-then-word.

              Default is `word'.

       --infogain-event-model=EVENTNAME
              Set  what  objects will be considered the `events' when informa-
              tion gain is calculated.  EVENTNAME can be one of:  word,  docu-
              ment, document-then-word.

              Default is `document'.

       -m, --method=METHOD
              Set  the word weight-setting method; METHOD may be one of: fien-
              berg-classify,    hem-classify,     hem-cluster,     multiclass,
              default=naivebayes.

       --print-word-scores
              During  scoring,  print  the  contribution  of each word to each
              class.

       --smoothing-dirichlet-filename=FILE
              The file containing the alphas for the dirichlet smoothing.

       --smoothing-dirichlet-weight=NUM
              The weighting factor by which to muliply the alphas for  dirich-
              let smoothing.

       --smoothing-goodturing-k=NUM
              Smooth  word  probabilities  for  words  that  occur NUM or less
              times. The default is 7.

       --smoothing-method=METHOD
              Set the method for smoothing word probabilities to avoid  zeros;
              METHOD may be one of: goodturing, laplace, mestimate, wittenbell

       --uniform-class-priors When setting weights, calculating infogain and
              scoring, use equal prior probabilities on classes.

       -?, --help
              Give this help list

       --usage
              Give a short usage message

       -V, --version
              Print program version

       Mandatory or optional arguments to long options are also  mandatory  or
       optional for any corresponding short options.

REPORTING BUGS
       Please  report  bugs related to this program to Andrew McCallum <mccal-
       lum AT cs.edu>. If the bugs are related to  the  Debian  package  send
       bugs to submit AT bugs.org

SEE ALSO
       arrow(1),  archer(1),  rainbow(1).  The full documentation for crossbow
       will be provided as a Texinfo manual.  If the info  and  crossbow  pro-
       grams are properly installed at your site, the command

              info crossbow

       should give you access to the complete manual.

       You   can   also   find   documentation   and  updates  for  libbow  at
       http://www.cs.cmu.edu/~mccallum/bow




crossbow                         November 2002                     CROSSBOW(1)
 

©2005 Comrite