W3Browse - Concepts

Caches

A cache is used to store responses from servers for later review. As implemented in w3browse, a cache is not designed to be shared among clients for the purpose of accelerating network access or similar things. Instead, its primary purpose is to record complete responses to certain requests in order to be able to replay them at other times, then probably without network access (off-line).

A cache does neither update nor delete any entry automatically. Only responses to GET requests with status codes "200" (OK), "301" (Moved Permanently) and "302" (Moved Temporarily) are stored regardless to any further header information, while all other requests and responses are just passed through (see also section "Requests").

Session Merging
Reorganization
Files
Notes on Merging
Remote Caches

Session Merging

Multiple caches from different sessions can be merged together to form one large off-line cache, also called a merged cache (see also the dialog "Build Cache Index"). This way, every new session can be started with an empty cache and filled with current documents. All sessions are merged later into the large cache whereby more recent entries override previous (presumedly older) ones.

The process of merging is non-destructive and copies only references to entries into another index, while the original sessions are kept and can be reviewed separately at any time. In this context, a session may be regarded as a snapshot of the retrieved documents. Note that a merged cache should always operate in off-line mode!

Reorganization

Another maintenance operation is the reorganization of a cache (see also the dialog "Reorganize Cache"). Its main purpose is to build a cache from other caches whereby certain clean-up and filtering operations can be performed. The filesystem layout of a cache can also be specified or changed. Furthermore, it is even possible to merge-in other caches into an existing cache. On the other hand, the filtering option can be used to extract particular parts from other caches.

All operations are destructive to a source cache if the files are moved instead of being copied to the new cache. In any case, the relation to the original cache gets lost, which means that the new cache, once it has been built, does not depend any longer on its sources.

Files

A cache is a collection of files and subdirectories that are located within a directory of the local filesystem. The name of a cache, as it is used within dialogs or other places, is the name of the directory in which it resides. Each response from a server including the headers is stored in a separate file that is named by a timestamp in hexadecimal format. The location of a new file is determined from a sequence number as described below. The main files of a cache are:

index.w3b: The main index file, it should never be deleted as it contains vital data. Each line is terminated by LF and consists of the URL of a request and, separated by a space character, the name of the response file relative to the root of the cache directory. The filename and the preceding space may be missing, indicating that this entry was deleted. New entries are always appended to the end of the file in a single write operation. Existing entries are never modified, so that the file may be regarded as a kind of logfile.
index.idx: Contains a balanced tree in binary form, provided for faster access to the main index file by URL. The tree can be re-built from index.w3b and is updated with every new or deleted entry.
index.cnt: This file contains just one line with four numbers which are separated by spaces: "files per dir", "dirs per dir", a sequence number and the last timestamp used. The last two numbers are updated with every new response file. This file is created if it does not already exist at the time when the first new response file is needed.

A cache may have an unlimited number of entries respectively files. Because putting all files into one large directory is usually not very efficient, an extended method that works very well even on CD-ROM is used by w3browse. In the following, f denotes the parameter "files per dir" (default = 100), d denotes "dirs per dir" (default = 26) and s denotes the sequence number that usually starts at zero and is incremented with every new file:

s < f

The first f files are put into the top-level directory.
s < (d + 1) * f

Next, a subdirectory is created and further f files are put there, then a second subdirectory is created and further f files are put there, and so on until d subdirectories have been created.
s < (d² + d + 1) * f

Now, up to d subdirectories within the first subdirectory are created and each of them receives f files, then up to d subdirectories within the second subdirectory are created and each of them receives f files, and so on until all d subdirectories each has d subdirectories.
...

The rest of the procedure is left as an exercise to the reader ;-)

Using the default parameters, up to 2700 files can be stored within one level of subdirectories and up to 70300 files within only two levels. The names of generated subdirectories are numbers that are derived from the sequence number at the corresponding level and are represented in a radix 36 notation using the characters 0-9 and a-z, in this order. The resulting names usually consist of just one digit.

Notes on Merging

When a cache is opened, a cache merger process is started automatically if the main index file index.w3b is not present or is empty. During this procedure, all immediate subdirectories are scanned for an index file and, if present, its contents is included into the merged cache index.

Because of this, it is recommended to organize different cache sessions as direct subdirectories of a directory that is then regarded as a merged cache. Because the cache merger sorts the names of subdirectories like numbers (by name but shorter names come first), it is best to use increasing numbers for the session names, probably with a fixed prefix.

Merging of merged caches is possible but maybe not very useful.

Remote Caches

Another kind of access to a cache may be called remote cache. Having received the main index file index.w3b of a cache and knowing its URL is sufficient in order to be able to get responses from that cache. This may be used to access a cache that is stored on a website or on an FTP server, or a cache may be sent by e-mail as a multipart MIME message. But it is not possible to update such a cache in general, so it should be regarded as an off-line cache. The missing file index.idx can easily be re-constructed from index.w3b and may be held in memory only, if it is needed at all.

Determing the URL for retrieving a response file is done by resolving its relative filename (as given in the index file) against the URL of the index file itself. It is not an accident that the filename can be used as a relative URL at the same time, without any (un)escaping.

The content-type of a retrieved entity will almost always be wrong, so it should be ignored and just be assumed that it is a valid response file from the cache. A more elaborated handling, namely probing the entity if it is in fact an HTTP response message and failing that, using the original response, makes it even possible to treat multipart MIME messages of type multipart/related as a remote cache. This is part of w3browse (see also section "e-Mail Application").