Comrite Unix Man page/Perldoc/Info page, English-Chinese Dictionary, Chinese-English Dictionary

aspseek-sql

Command: man perldoc info search(apropos)  


 
aspseek-sql(5)               ASPseek user's guide               aspseek-sql(5)



NAME
       aspseek-sql - the structure of SQL database tables used by ASPseek

SQL TABLES
   wordurl
       This  table  keeps  information  about  each word in main and real-time
       database, one record per word.

       +-----------------+----------------------------------------------------+
       |  Field          |    Description                                     |
       +-----------------+----------------------------------------------------+
       +-----------------+----------------------------------------------------+
       |  word           |    Word itself.                                    |
       +-----------------+----------------------------------------------------+
       |  word_id        |    Numeric ID of word.                             |
       +-----------------+----------------------------------------------------+
       |  urls           |    Information about sites and urls,  in  which    |
       |                 |    word  is encountered.  Empty if size of info    |
       |                 |    is greater than 1000  bytes,  in  this  case    |
       |                 |    info is stored in separate file.                |
       +-----------------+----------------------------------------------------+
       |  urlcount       |    Number of URLs in which word is encountered.    |
       +-----------------+----------------------------------------------------+
       |  totalcount     |    Total count of this word in all URLs.           |
       +-----------------+----------------------------------------------------+
       Last 3 fields are used only if CompactStorage is set to no, and updated
       after finishing of crawling, or then index(1) is run with -D option.

   wordurl1
       This table keeps all information about each word in real-time database,
       one record per word.

       +-----------------+----------------------------------------------------+
       |  Field          |    Description                                     |
       +-----------------+----------------------------------------------------+
       +-----------------+----------------------------------------------------+
       |  word           |    Word itself.                                    |
       +-----------------+----------------------------------------------------+
       |  word_id        |    Numeric ID of word, refers to wordurl.word.     |
       +-----------------+----------------------------------------------------+
       |  urls           |    Information  about  sites  and urls in which    |
       |                 |    word  is  encountered.   Always  not   empty    |
       |                 |    regardless of size.                             |
       +-----------------+----------------------------------------------------+
       |  urlcount       |    Number of URLs in which word is encountered.    |
       +-----------------+----------------------------------------------------+
       |  totalcount     |    Total count of this word in all URLs.           |
       +-----------------+----------------------------------------------------+
       Last  3  fields are updated immediately after downloading of the URL by
       index(1) when it is run with -T option.

   urlword
       This table keeps information about all encountered URLs,  both  indexed
       and  not  indexed yet which match specified conditions in configuration
       files.

       +----------------------+-----------------------------------------------+
       |  Field               |    Description                                |
       +----------------------+-----------------------------------------------+
       +----------------------+-----------------------------------------------+
       |  url_id              |    ID of URL.                                 |
       +----------------------+-----------------------------------------------+
       |  site_id             |    ID of site, refers to sites.site_id.       |
       +----------------------+-----------------------------------------------+
       |  deleted             |    Set to 1 if server  returned  404  error   |
       |                      |    and  DeleteBad  is  set  to  yes,  or if   |
       |                      |    robots.txt or configuration rules disal-   |
       |                      |    low to index this URL.                     |
       +----------------------+-----------------------------------------------+
       |  url                 |    URL itself.                                |
       +----------------------+-----------------------------------------------+
       |  next_index_time     |    Time  of  next  indexing in seconds from   |
       |                      |    UNIX epoch.                                |
       +----------------------+-----------------------------------------------+
       |  status              |    HTTP status returned by server or  0  if   |
       |                      |    document has not been indexed yet.         |
       +----------------------+-----------------------------------------------+
       |  crc                 |    MD5 checksum of document.                  |
       +----------------------+-----------------------------------------------+
       |  last_modified       |    "Last-Modified"  HTTP header returned by   |
       |                      |    HTTP server.                               |
       +----------------------+-----------------------------------------------+
       |  etag                |    "ETag" header returned by HTTP server.     |
       +----------------------+-----------------------------------------------+
       |  last_index_time     |    Time of last indexing  in  seconds  from   |
       |                      |    UNIX epoch.                                |
       +----------------------+-----------------------------------------------+
       |  referrer            |    ID of URL which first referred this URL.   |
       +----------------------+-----------------------------------------------+
       |  tag                 |    Arbitrary tag.                             |
       +----------------------+-----------------------------------------------+
       |  hops                |    Depth of URL in hyperlink tree.            |
       +----------------------+-----------------------------------------------+
       |  redir               |    URL  ID, where current URL is redirected   |
       |                      |    or 0 if this URL is not redirected.        |
       +----------------------+-----------------------------------------------+
       |  origin              |    URL ID of document which  is  origin  of   |
       |                      |    this cloned document, or zero if this is   |
       |                      |    not clone.                                 |
       +----------------------+-----------------------------------------------+
   urlwordsNN (where NN is 2-digit number from 00-15)
       These tables contain additional info about existing indexed URLs.  Num-
       ber NN in table name is URL_ID mod 16.

       +--------------------+-------------------------------------------------+
       |   Field            |     Description                                 |
       +--------------------+-------------------------------------------------+
       +--------------------+-------------------------------------------------+
       |   deleted          |     Set  to  1  if server returned 404 error    |
       |                    |     and DeleteBad  is  set  to  yes,  or  if    |
       |                    |     robots.txt or configuration rules disal-    |
       |                    |     low to index this URL.                      |
       +--------------------+-------------------------------------------------+
       |   wordcount        |     Count of unique  words  in  the  indexed    |
       |                    |     part of URL.                                |
       +--------------------+-------------------------------------------------+
       |   totalcount       |     Total count of words in the indexed part    |
       |                    |     of URL.                                     |
       +--------------------+-------------------------------------------------+
       |   content_type     |     Content-Type  HTTP  header  returned  by    |
       |                    |     server.                                     |
       +--------------------+-------------------------------------------------+
       |   charset          |     Document charset taken from Content-Type    |
       |                    |     HTTP header or META.                        |
       +--------------------+-------------------------------------------------+
       |   title            |     First 128 characters from pages title.      |
       +--------------------+-------------------------------------------------+
       |   txt              |     First 255  characters  from  page  body,    |
       |                    |     stripped from HTML tags.                    |
       +--------------------+-------------------------------------------------+
       |   docsize          |     Total size of URL.                          |
       +--------------------+-------------------------------------------------+
       |   keywords         |     First 255 characters from page keywords.    |
       +--------------------+-------------------------------------------------+
       |   description      |     First  100 characters from page descrip-    |
       |                    |     tion.                                       |
       +--------------------+-------------------------------------------------+
       |   lang             |     Not used now.                               |
       +--------------------+-------------------------------------------------+
       |   words            |     Zipped content of URL.                      |
       +--------------------+-------------------------------------------------+
       |   hrefs            |     Sorted array of outgoing href  IDs  from    |
       |                    |     this URL.                                   |
       +--------------------+-------------------------------------------------+
   robots
       This  table  contains  information parsed from robots.txt file for each
       site.

       +-----------------------+----------------------------------------------+
       |      Field            |         Description                          |
       +-----------------------+----------------------------------------------+
       +-----------------------+----------------------------------------------+
       |      hostinfo         |         Host name.                           |
       +-----------------------+----------------------------------------------+
       |      path             |         Path to exclude from indexing.       |
       +-----------------------+----------------------------------------------+
   sites
       This table contains IDs for all indexed sites.

       +-------------------------+--------------------------------------------+
       |       Field             |          Description                       |
       +-------------------------+--------------------------------------------+
       +-------------------------+--------------------------------------------+
       |       site_id           |          ID of site.                       |
       +-------------------------+--------------------------------------------+
       |       site              |          Site name  with  protocol,        |
       |                         |          like http://www.my.com/.          |
       +-------------------------+--------------------------------------------+
   stat
       This  table  contains  information about query statistics for each com-
       pleted query.

       +-------------------+--------------------------------------------------+
       |    Field          |       Description                                |
       +-------------------+--------------------------------------------------+
       +-------------------+--------------------------------------------------+
       |    addr           |       IP address  of  computer,  from  which     |
       |                   |       query was requested.                       |
       +-------------------+--------------------------------------------------+
       |    proxy          |       IP  address  of  proxy server, through     |
       |                   |       which query was requested.                 |
       +-------------------+--------------------------------------------------+
       |    query          |       Query string.                              |
       +-------------------+--------------------------------------------------+
       |    ul             |       URL limit used to restrict the query.      |
       +-------------------+--------------------------------------------------+
       |    sp             |       Web spaces used to restrict the query.     |
       +-------------------+--------------------------------------------------+
       |    site           |       Site ID used to restrict the query.        |
       +-------------------+--------------------------------------------------+
       |    np             |       Results page number requested.             |
       +-------------------+--------------------------------------------------+
       |    ps             |       Results per page.                          |
       +-------------------+--------------------------------------------------+
       |    sites          |       Number of found sites matching query.      |
       +-------------------+--------------------------------------------------+
       |    urls           |       Number of found URLs matching query.       |
       +-------------------+--------------------------------------------------+
       |    start          |       Query processing start in seconds from     |
       |                   |       UNIX epoch.                                |
       +-------------------+--------------------------------------------------+
       |    finish         |       Query  processing  finish  in  seconds     |
       |                   |       from UNIX epoch.                           |
       +-------------------+--------------------------------------------------+
       |    referer        |       URL of web page from which  query  was     |
       |                   |       requested.                                 |
       +-------------------+--------------------------------------------------+
   subsets
       Table describing all subsets, which can be used to restrict the search.
       Populated manually with URL masks. Subset is the set of URLs  from  the
       particular  directory  of  site. Putting masks describing whole site is
       not necessary.

       +-------------------------+--------------------------------------------+
       |      Field              |          Description                       |
       +-------------------------+--------------------------------------------+
       +-------------------------+--------------------------------------------+
       |      subset_id          |          ID of subset.                     |
       +-------------------------+--------------------------------------------+
       |      mask               |          URL     mask.     Example:        |
       |                         |          http://www.my.com/dir/%.          |
       |                         |          Examples  of  wrong   use:        |
       |                         |          http://www.aspstreet.com/%,       |
       |                         |          http://www.aspstreet/%.           |
       +-------------------------+--------------------------------------------+
   spaces
       Table describing web spaces. Web space is the set of sites.  Each  site
       belonging  to particular space must be put into separate record.  Popu-
       lated manually or using -A option of index. If populated manually,  run
       index -B after changing this table.

       +-------------------------+--------------------------------------------+
       |      Field              |          Description                       |
       +-------------------------+--------------------------------------------+
       +-------------------------+--------------------------------------------+
       |      space_id           |          ID of web space.                  |
       +-------------------------+--------------------------------------------+
       |      site_id            |          ID  of  site  belonging to        |
       |                         |          the   space,   refers   to        |
       |                         |          sites.site_id.                    |
       +-------------------------+--------------------------------------------+
   tmpurl
       Table  describing  URLs  indexed since start of last indexing. Used for
       debugging.

       +-----------------+----------------------------------------------------+
       |    Field        |      Description                                   |
       +-----------------+----------------------------------------------------+
       +-----------------+----------------------------------------------------+
       |    url_id       |      URL ID.                                       |
       +-----------------+----------------------------------------------------+
       |    thread       |      Ordinal thread number, which indexed URL.     |
       +-----------------+----------------------------------------------------+
   wordsite
       Auxiliary table used when search is restricted to site pattern.   Built
       at the end of indexing from sites table.

       +-------------------+--------------------------------------------------+
       |     Field         |        Description                               |
       +-------------------+--------------------------------------------------+
       +-------------------+--------------------------------------------------+
       |     word          |        Word used in site name between dots.      |
       +-------------------+--------------------------------------------------+
       |     sites         |        Array  of  site IDs, where this word      |
       |                   |        is encountered.                           |
       +-------------------+--------------------------------------------------+
   citation
       This table contains reverse index of hyperlinks. It is used only if In-
       crementalCitations is set to no.

       +--------------------------+-------------------------------------------+
       |      Field               |          Description                      |
       +--------------------------+-------------------------------------------+
       +--------------------------+-------------------------------------------+
       |      url_id              |          URL ID.                          |
       +--------------------------+-------------------------------------------+
       |      referrers           |          Array  of  URL  IDs, which       |
       |                          |          have  hyperlink  to   this       |
       |                          |          URL.                             |
       +--------------------------+-------------------------------------------+
BLOBS
   wordurl.urls, wordurl1.urls
       +----------------------------------------------------------------------+
       | Sites information, ordered by site_id.                               |
       +----------------+-------------+---------------------------------------+
       | Offset         | Length      |  Description                          |
       +----------------+-------------+---------------------------------------+
       +----------------+-------------+---------------------------------------+
       |          0     |     4       |  Offset of URL info for 1st site.     |
       +----------------+-------------+---------------------------------------+
       |          4     |     4       |  ID of 1st site where word is en-     |
       |                |             |  countered.                           |
       +----------------+-------------+---------------------------------------+
       |          8     |     4       |  Offset of URL info for 2nd site.     |
       +----------------+-------------+---------------------------------------+
       |         12     |     4       |  ID of 2nd site where word is en-     |
       |                |             |  countered.                           |
       +----------------+-------------+---------------------------------------+
       |                                 ...                                  |
       +----------------+-------------+---------------------------------------+
       |    (N-1)*8     |     4       |  Offset of URL info for Nth site,     |
       |                |             |  where N is the total  number  of     |
       |                |             |  sites  in  which word is encoun-     |
       |                |             |  tered.                               |
       +----------------+-------------+---------------------------------------+
       |  (N-1)*8+4     |     4       |  Offset of URL info for Nth site.     |
       +----------------+-------------+---------------------------------------+
       |  (N-1)*8+8     |     4       |  Offset of URL info end  for  Nth     |
       |                |             |  site.  Must  point to the end of     |
       |                |             |  blob or file.                        |
       +----------------+-------------+---------------------------------------+
       | URLs information. Follows sites  information  immediately.           |
       | Offsets are counted from 0.                                          |
       +----------------+-------------+---------------------------------------+
       | Offset         | Length      |  Description                          |
       +----------------+-------------+---------------------------------------+
       +----------------+-------------+---------------------------------------+
       |          0     |     4       |  URL  ID  of  first site in sites     |
       |                |             |  info section.                        |
       +----------------+-------------+---------------------------------------+
       |          4     |     2       |  Word count in this URL.              |
       +----------------+-------------+---------------------------------------+
       |          6     |     2       |  First position.                      |
       +----------------+-------------+---------------------------------------+
       |          8     |     2       |  Second position.                     |
       +----------------+-------------+---------------------------------------+
       |                                 ...                                  |
       +----------------+-------------+---------------------------------------+
       |  6+(N-1)*2     |     2       |  Nth position, where N is the to-     |
       |                |             |  tal word count in the URL.           |
       +----------------+-------------+---------------------------------------+
       |     Repeated  with  info  for URLs from the same site, with ID       |
       |     greater than previous.                                           |
       +----------------------------------------------------------------------+
       |                                 ...                                  |
       +----------------------------------------------------------------------+
       |     Repeated with info for URLs for next sites from sites info       |
       |     section.                                                         |
       +----------------------------------------------------------------------+
   urlwordsNN.words
       This field contains gzipped content of URL.

       +------------+-------------------+-------------------------------------+
       |  Offset    |   Length          |   Description                       |
       +------------+-------------------+-------------------------------------+
       +------------+-------------------+-------------------------------------+
       |    0       |        4          |   Size of URL content before zip-   |
       |            |                   |   ping or 0xFFFFFFFF  if  content   |
       |            |                   |   is not zipped.                    |
       +------------+-------------------+-------------------------------------+
       |    4       |   Zipped size     |   Zipped or original URL content.   |
       +------------+-------------------+-------------------------------------+
   wordsite.sites
       This  field  contains array of sites/positions for word. Sorted by site
       IDs.

       Structure of array element:

       +------------------------+---------------------------------------------+
       |       Bits             |           Description                       |
       +------------------------+---------------------------------------------+
       +------------------------+---------------------------------------------+
       |       24-31            |           Bitmap of positions, high-        |
       |                        |           est  bit  is  set  to 1 is        |
       |                        |           word  is  first-level  do-        |
       |                        |           main.                             |
       +------------------------+---------------------------------------------+
       |        0-23            |           Site ID.                          |
       +------------------------+---------------------------------------------+
FILES
       /etc/aspseek/DBType/tables.sql

SEE ALSO
       aspseek(7), index(1), searchd(1).

AUTHORS
       Copyright (C) 2000, 2001, 2002 by SWsoft.
       Man  page  by Kir Kolyshkin <kir AT asplinux.ru> and Alexander F. Avdonkin
       <al AT asplinux.ru>.



ASPseek v.1.2.10                  2002/06/10                    aspseek-sql(5)
 

©2005 Comrite