


index(1)		  User Commands			 index(1)



NAME
     index - SWISH++ indexer

SYNOPSIS
     index [ options ] directory... file...

DESCRIPTION
     index is the SWISH++ file indexer.	 It indexes the	specified
     files  and	files in the specified directories; files in sub-
     diretories	of specified directories are also indexed.  Files
     are  indexed  only	 if their filename extension is	among the
     set specified with	the -e option.

  Character Entities
     Both HTML character and numeric  (decimal	and  hexadecimal)
     entity  references	 are  converted	 to their ASCII	character
     equivalents before	further	examination  and  indexing.   For
     example,  ``r&eacute;sum&#233;''  becomes	``resume'' before
     indexing.

  Word Determination
     Stop words, words that  occur  too	 frequently  or	 have  no
     information   content,   are   not	 indexed.   (There  is	a
     compiled-in set of	a few hundred such words.)  Additionally,
     several  heuristics are used to determine which words should
     not be indexed.

     First, a word is checked to see if	it looks like an acronym.
     A	word  is  considered  an acronym only if it starts with	a
     capital  letter  and  is  composed	 exclusively  of  capital
     letters,  digits,	and  punctuation symbols, e.g.,	``AT&T.''
     If	a word looks like  an  acronym,	 it  is	 indexed  and  no
     further checks are	done.

     Second, there are several other checks that are applied.	A
     word is not indexed if it:

     1.	 Starts	with a capital letter, is of mixed case, and con-
	 tains	more  than  a third capital letters, e.g., ``Biz-
	 ZARE.''

     2.	 Contains a capital letter other  than	the  first,  e.g,
	 ``weIrd.''

     3.	 Is less than Word_Min_Size letters.  (Default is 4.)

     4.	 Contains no vowels.

     5.	 Contains more	than  Word_Max_Consec_Same  of	the  same
	 character    consecutevely   (not   including	 digits).
	 (Default is 2.)




SWISH++		 Last change: February 27, 1998			1






index(1)		  User Commands			 index(1)



     6.	 Contains more	than  Word_Max_Consec_Vowels  consecutive
	 vowels.  (Default is 4.)

     7.	 Contains more than  Word_Max_Consec_Consonants	 consecu-
	 tive consonants.  (Default is 5.)

OPTIONS
     -eextension   A filename extension	of files to index without
		   the	``dot.''   Multiple  -e	 options  may  be
		   specified.

     -ffile_max	   The maximum number of files a word  may  occur
		   in  before  it  is discarded	as being too fre-
		   quent.  The default is infinity.

     -iindex-file  The name of the  generated  index  file.   The
		   default  is	the.index  in the present working
		   directory.

     -l		   Follow symbolic links  during  indexing.   The
		   default is not to follow them.

     -ppercent_max The maximum percentage of  files  a	word  may
		   occur  in  before it	is discarded as	being too
		   frequent.  The detault is 100.  If you want to
		   keep	 all  words  regardless, specify a number
		   greater than	100.

     -vverbosity   The verbosity level,	0-3:

		   0   No  output  is	generated   (except   for
		       errors).
		   1   Only run	statistics (elapsed time,  number
		       of files, word count) are printed.
		   2   Directories  are	  printed   as	 indexing
		       progresses.
		   3   Directories and files are printed  with	a
		       word-count for each file.

     -V		   Print the version number of SWISH++ and exit.

EXAMPLE
     To	index all HTML and text	files on a web server:

	  index	-v3 -e html -e shtml -e	txt /home/www/htdocs


EXIT STATUS
     Exits with	a  value  of  zero  only  if  indexing	completed
     sucessfully; non-zero otherwise.





SWISH++		 Last change: February 27, 1998			2






index(1)		  User Commands			 index(1)



CAVEATS
     Files without extensions  can  not	 be  indexed.	Generated
     index  files  are	machine-dependent (size	of data	types and
     byte-order).

FILES
     the.index	   default index file name

SEE ALSO
     extract(1), search(1)

     International Standards Organization.  ``ISO 8859-1:  Infor-
     mation Processing -- 8-bit	single-byte coded graphic charac-
     ter sets -- Part 1: Latin alphabet	No. 1.''  1987.

     International Standards Organization.  ``ISO 8879:	 Informa-
     tion  Processing -- Text and Office Systems -- Standard Gen-
     eralized Markup Language (SGML)'' 1986.

     World Wide	Web Consortium.	 ``Character entity references in
     HTML 4.0.''  HTML 4.0 Specification, http://www.w3.org/

AUTHOR
     Paul J. Lucas <pjl@best.com>































SWISH++		 Last change: February 27, 1998			3



