Dictionaries

A dictionary, in the sense of w3browse, allows searching for a list of documents that contain a given word. A more sophisticated tool can be used to provide enhanced search capabilities such as boolean operators and word truncation. The implementation of dictionaries is rudimentary, no ranking of words or documents is carried out.

Indexing

Dictionaries can be created from different sources, e.g. from a cache or from files on the local filesystem (see also the dialog "Build Dictionary"). During the indexing process, all different words that are contained in a set of documents are extracted and put into a list. Each word of that list is associated to a set of references to documents which contain this word. The result is an inverted index.

A word is only taken into consideration if it consists of at least three characters and satisfies other optional requirements, such as not to be a member of a stopword list or not to be interpretable as a number, or both. As implemented in w3browse, a word is composed of a sequence of consecutive letters (A-Z) and digits (0-9), after a transliteration of the input to US-ASCII characters has taken place.

Files

The files that constitute a dictionary are usually identified by a prefix that is a basename appended to the name of a directory. Certain suffixes are further appended to such a prefix in order to get the real filenames of the dictionary files. The suffixes are:

.url.key

Each line of this file is terminated by LF and contains some kind of reference to an indexed document. The interpretation of a line is up to the application, e.g. URL (cache), relative filename (files), ID+title (rfc-index).

.url.idx

Contains a balanced tree in binary form for faster access to document references.

.word.key

Each entry of this binary file contains a word together with a pointer to a list within the .url.ptr file.

.word.idx

Contains a balanced tree in binary form for faster access to word entries.

.url.ptr

This binary file contains a list of pointers to document references for each word.

Restrictions

Currently it is not possible to update or even extend a dictionary once it has been built. Instead, it has to be re-created entirely if indexed documents are modified or if documents are to be added to or deleted from the dictionary, but this is actually only required if the difference between indexed and current documents really matters. A dictionary by itself is independent of the indexed documents.