W3Browse - Dialogs

Build Dictionary

This dialog can be used to create different kinds of dictionaries. After accepting the settings, a DictIndex window that shows the progress in building the dictionary is opened.

The following parameters for building a dictionary can be specified:

DictPref (text)

Specifies the prefix (directory plus basename) of a dictionary that should be created.

DictType (list: dictdb, dummy, sql, text)

Determines the type of the parameter DictPref. The following types are supported:

dictdb: a standard dictionary is created,
dummy: no output is generated and the DictPref parameter is ignored,
sql: like "text" except that the extension .sql is used and that each line makes up a pgsql command,
text: a single file named by the DictPref parameter with the extension .txt appended is generated that contains LF terminated lines, each consisting of a document reference followed by a space and a space-separated list of words found in that document.

Note that the files generated by "sql" and "text may be used again as input (see below).

In Memory (checkbox)

Indicates that dictionaries of type "dictdb" should be created in memory and finally written out at once. This is faster and therefore recommended except for very large dictionaries.

Filter (list: chkwords, dummy, stopwords)

Selects a filter that is to be applied on words before they are passed through for further processing. The following filters are available:

dummy: no filtering,
chkwords: all words that look like a number are skipped,
stopwords: like "chkwords", but additionally all words that are a member of the stopword list are removed.

The list of stopwords currently consists of 572 entries from the three languages english, french and german.

Directory (text)

Specifies the name of an existing file or directory of the source documents that should be indexed.

DirType (list: cache, files, rfc-index, sql, text)

Determines the type of the parameter Directory. The following types are supported:

cache: the specified directory is a cache that contains the input documents and their URLs are taken as document references,
files: all files on the local filesystem underneath the specified directory are indexed recursively and their relative filenames are taken as document references,
rfc-index: the files of RFC documents (optionally compressed by gzip) are spread over subdirectories underneath the specified directory on the local filesystem and are named by the RFC number divided by 100, also the RFC-index is read and the titles found there, prepended by the RFC number and a space, are taken as document references,
sql: like "text" except that each line makes up a pgsql command,
text: the parameter Directory specifies a single file that contains LF or CRLF terminated lines, each consisting of a document reference followed by a space and a space-separated list of associated words that are finally used for further processing.

Note that the files specified for "sql" and "text" may have been generated by the use of this dialog or by other tools.

Include Types (text)

Defines a space-separated list of patterns of content-types to be exclusively included into the dictionary. If not defined, all content-types are selected.

Exclude Types (text)

Defines a space-separated list of patterns of content-types to be excluded from the dictionary, superceding conflicting include types.

Include URLs (text)

Specifies a space-separated list of prefixes of URLs to be exclusively included into the dictionary. If not specified, all URLs are selected.

Exclude URLs (text)

Specifies a space-separated list of prefixes of URLs to be excluded from the dictionary. These prefixes have precedence of conflicting include prefixes.

Notes

The parameters Include Types and Exclude Types allow to selectively determine which content-types of documents that originate from a cache or from files of the local filesystem should be indexed. If both fields are empty, all documents for which a handler or converter is available are indexed. Built-in handlers are available for the following content-types:

text/plain (plain text) and documents named README (case-insensitive),
text/html (HTML) and application/xhtml+xml (XHTML),
text/vnd.wap.wml (WML) and application/vnd.wap.wmlc (WMLc, binary form of WML),
*/xml and */*+xml (generic XML, just strips off all tags).

Other converters that make use of external programs can be defined by the help of the "MIME Applications" configuration applet, but note that only converters that return plain text (text/plain, optionally with a charset parameter) are taken into account.

For the sake of most applications that make use of dictionaries, it is recommended to place the files of the dictionary into the same directory that is also used to specify the source documents that should be indexed (see the parameter Directory).