This dialog can be used to create different kinds of dictionaries. After accepting the settings, a DictIndex window that shows the progress in building the dictionary is opened.
The following parameters for building a dictionary can be specified:
Specifies the prefix (directory plus basename) of a dictionary that should be created.
dictdb
, dummy
,
sql
, text
)Determines the type of the parameter DictPref. The following types are supported:
dictdb
dummy
sql
text
" except that the extension .sql
is
used and that each line makes up a pgsql command,text
.txt
appended is generated that contains LF terminated
lines, each consisting of a document reference followed by a space and a
space-separated list of words found in that document.Note that the files generated by "sql
" and "text
may be used again as input (see below).
Indicates that dictionaries of type "dictdb
" should be
created in memory and finally written out at once. This is faster and
therefore recommended except for very large dictionaries.
chkwords
, dummy
,
stopwords
)Selects a filter that is to be applied on words before they are passed through for further processing. The following filters are available:
dummy
chkwords
stopwords
chkwords
", but additionally all words that are a
member of the stopword list are removed.The list of stopwords currently consists of 572 entries from the three languages english, french and german.
Specifies the name of an existing file or directory of the source documents that should be indexed.
cache
, files
,
rfc-index
, sql
, text
)Determines the type of the parameter Directory. The following types are supported:
cache
files
rfc-index
sql
text
" except that each line makes up a
pgsql command,text
Note that the files specified for "sql
" and "text
"
may have been generated by the use of this dialog or by other tools.
Defines a space-separated list of patterns of content-types to be exclusively included into the dictionary. If not defined, all content-types are selected.
Defines a space-separated list of patterns of content-types to be excluded from the dictionary, superceding conflicting include types.
Specifies a space-separated list of prefixes of URLs to be exclusively included into the dictionary. If not specified, all URLs are selected.
Specifies a space-separated list of prefixes of URLs to be excluded from the dictionary. These prefixes have precedence of conflicting include prefixes.
The parameters Include Types and Exclude Types allow to selectively determine which content-types of documents that originate from a cache or from files of the local filesystem should be indexed. If both fields are empty, all documents for which a handler or converter is available are indexed. Built-in handlers are available for the following content-types:
text/plain
(plain text) and documents named README
(case-insensitive),text/html
(HTML) and application/xhtml+xml
(XHTML),text/vnd.wap.wml
(WML) and application/vnd.wap.wmlc
(WMLc, binary form of WML),*/xml
and */*+xml
(generic XML, just strips off
all tags).Other converters that make use of external programs can be defined by the
help of the "MIME Applications" configuration
applet, but note that only converters that return plain text
(text/plain
, optionally with a charset
parameter)
are taken into account.
For the sake of most applications that make use of dictionaries, it is recommended to place the files of the dictionary into the same directory that is also used to specify the source documents that should be indexed (see the parameter Directory).