HTML Documents

HTML is probably the most common used format for pages on the web. This markup language was originally designed for structural markup of content, but in the past, it has also been extended (and abused) by the introduction of several presentational elements. Nowadays, the separation of content and layout becomes more and more important as the number of different applications that make use of HTML documents is constantly growing. This is supported by the increasing use of XHTML, a reformulation of HTML in XML, together with style sheets.

Structure of an HTML Document

The body of a simple enough and well-structured HTML document is divided into a number of sections. Each section consists of a heading that is followed by a mixture of subsections, paragraphs and different kinds of lists, where the list items themselves are composed of paragraphs as well as sublists. These kinds of components of a document are called block-level elements.

The actual content of an HTML document, as it appears in paragraphs, headings, definition terms and other such places, is primarily constituted by plain text. Further so-called inline (or character-level) elements are used as well to add structural information, e.g. emphasis or hyperlinks, to fragments of text.

An HTML document also includes certain information about itself and/or its relation to other documents. This meta data consists of at least the title of the document, but a brief description and some keywords should be additionally given for the sake of search engines. References to style sheets and other related documents can also be specified.

The web page editor is able to read all known formats of HTML documents (see also [HTML]) and extract the relevant information, at the same time also stripping off any ballast. A generated HTML document makes always use of the XHTML 1.0 format, but without using an explicit document type declaration.

Textual Representation of HTML

The textual representation of a simple HTML document looks like this:

#Title A Simple HTML Document
#Lang en-us
#Meta:description This is an example of a simple HTML document.
#Meta:keywords example, HTML document, inline markup
#Meta:date 2004-06-30
#Link:stylesheet styles/default.css
#Link:alternate:stylesheet styles/alternate.css Alternate Style

= A Simple HTML Document =

This HTML document is just an example and is used to
demonstrate some features such as:

  1. Paragraphs, headings and [#inline||inline
     markup].
  2. A horizontal rule at the end of the document.
  3) Some kinds of lists.

== [|inline|Inline Markup] ==

Some examples of inline markup:

  * _emphasis_ and *strong emphasis*
  * 'sample text'
  * [internal://admin/editpage||Link to EditPage]

----------
Meta Data

All lines at the beginning of the textual representation of an HTML document up to the first empty line are special and are used to define the meta data of the document. Each line provides a certain piece of information and is split into a sequence of white-space-separated words. The case-insensitive first word or a prefix of it determines the meaning of the rest of the line. The following cases are distinguished:

#Title (whole word)
The following words make up the contents of the <title> element of the document.
#Lang (whole word)
The following word specifies the language of the document and consists of an international language code that is optionally extended by a country code, e.g. en, en-us, de, de-de. It is used to define the lang and xml:lang attributes of the <html> root element.
#Meta: (prefix)
A <meta> element is created whose name attribute is set to the rest of the first word and whose content attribute is created from the following words.
#Link: (prefix)
This prefix introduces a <link> element whose rel attribute is a space-separated list of values that is derived from the rest of the first word which in turn is a colon-separated (:) list of values. The href attribute is set to the second word and the optional title attribute is created from the third and following words if present.

All lines that contain unknown or invalid meta data definitions are just ignored.

Body Part

The body of an HTML document is build up by all remaining lines that follow the first empty line of the textual representation of an HTML document. Each line is first preprocessed and is then evaluated in the current context of a certain block-level element, starting with a hypothetical non-intended body element.

An empty line or an indented line usually starts a new block-level element, while an outdented line (or the end of the input) always terminates the current element and causes the enclosing element to become current again. That line is then processed once more with respect to the changed context, which may cause the line to become outdented again, and so on until the body element may be reached at latest. The collected (actual) text of a terminated element is finally examined in order to identify inline elements.

During preprocessing of a line, trailing white-space characters are removed and consecutive empty lines are collapsed into one. All leading tabulator characters of a line are finally replaced by eight space characters each.

The indentation of a line is determined by the total number of space characters at the beginning of a line. The precise and absolute value of the indentation usually does not matter, only its relative value in relation to the indentation of an enclosing element such as a list item is relevant (less than = outdented, equal to = non-indented, greater than = indented). If not otherwise stated in the following sections, a line is generally meant to be already adjusted according to the current context, thus normally having a relative indentation of zero.

Empty lines are usually only required in order to separate adjacent text blocks into paragraphs. Text blocks that make up the sole content of (definition) list items are already distinguished by this fact, but it is possible to insert additional empty lines in order to add extra spacing between such items. As a rule of thumb, a text block is regarded as a paragraph and marked up accordingly if it is additionally surrounded by empty lines before and after. This is always assumed to be the case at the top-level and hence in the context of the body element.

Block-level Elements

A paragraph is just a sequence of non-empty lines that cannot be interpreted in any other way, e.g. as terms of a definition list item. Adjacent paragraphs are separated by empty lines. The given division of a paragraph into lines is preserved between editions of the document.

A block of preformatted text starts with an indented line that cannot be interpreted in any other way, e.g. as the start of a list item. All following lines that have an equal or greater indentation than the first line are part of the same block. Intervening empty lines are also included. Every line of the block is finally examined separately for inline elements.

A heading consists of just one line that encloses the actual heading title by one space character on each side first and then by the same sequence of one or more equal signs (=) on both sides. The number of equal signs indicate the importance level of a heading with level 1 being the most important.

= Level 1 Heading =
== Level 2 Heading ==
...
====== Level 6 Heading ======

A horizontal rule consists of just one line that is a sequence of four or more identical characters that are taken from the set "-=*+~_#^".

********
======
----

A blockquote section is very similar to an unordered list item. The only two differences are that the bullet consists of a ">" character and that consecutive blockquote sections are not merged in any way.

A blockquote section follows:

  > This is a rather short
    block quotation.

----------

An item of an ordered or unordered list is introduced by an indented line that starts with a certain so-called bullet and that is further followed by at least one space character. The indentation of that line that results from replacing the bullet by the same number of spaces determines the indentation of a newly created block-level element that is used for the content of the list item. An unindented version of the line is finally processed again in the new context.

The rest of the list item is completed by the usual processing which allows any block-level element to be used in the new context and which also causes the list item to be terminated by an outdented line. That line usually either introduces the next list item or terminates the list. Consecutive list items of the same type are finally merged into a list of that type. The type of a list item is derived from its bullet.

The bullet of an unordered list item consists of just one of the characters "*", "-" or "+".

An unordered list follows:

  * first item
  *     second item
    * third
      item

----------

There are three types of ordered list items available:

The bullet of an ordered list item consists of a non-empty sequence of only those characters that belong to the same type, and to which either a "." or ")" character is finally appended. The actual value of the character sequence does currently not matter.

An ordered list follows:

  1. one
  2)    two
     3. three

Another ordered list follows:

	a) apple
     b.    banana
  c) cherry

----------

An item of a definition list consists of one or more definition terms and the actual definition. A definition term is a sequence of non-empty and non-indented lines with the last line being suffixed by two colons (::). The actual definition starts with an indented line that cannot be interpreted as an item of an ordered or unordered list and is processed the same way as the content of such a list item. Consecutive definition list items are finally merged into a definition list.

A definition list follows:

First term::
    Definition of the first term.
Second
term::
Third term::
  Definition of the second and
  the third term.

----------

Multiple definition terms for the same definition cannot be separated by empty lines. When a definition term is spread over multiple lines, all of them but the last cannot end in any of the characters ".", "!", "?" or ":", which are common terminators of sentences. This rule allows to start a definition term immediately following a text block without an intervening empty line, because otherwise the preceding text would become part of the definition term, which is usually not intended in such cases.

The content of a definition term may be empty, in which case the definition term consists only of the terminating two colons (::) and causes the definition term to be effectively omitted. So, if all definition terms for a definition are cancelled that way, the definition list item represents just an indented block and is comparable in its effect to a blockquote section.

::
    ::
	This is a double
	indented paragraph.

----------

A so-called compact definition list item consists of a definition term that is followed by the first line of the actual definition on the same line. Both parts must be separated by at least three spaces. The definition term cannot be terminated by the usual two colons (::), but they must be used nevertheless for an empty term. Such a definition term may be regarded as a kind of user-defined bullet of a list item that is a definition in this case. In fact, further processing really follows this model and also causes the procedure to be applied recursively if needed. As a result, it is even possible to emulate simple tables by using compact definition lists. Note also the two empty cells 2.2 and 3.3 in the following example.

An indented emulated simple table follows:

::
    1.1   1.2   1.3   1.4
    2.1   ::    2.3   2.4
    3.1   3.2   ::    3.4

-------------------------

Another way to introduce an item of a compact definition list is to place the definition term on one line that is immediately followed by the actual definition without an intervening empty line. Such a definition term cannot be terminated by the usual two colons (::).

A compact definition list follows:

First term
    Definition of the first term.
Second term
  Definition of the second term.
Third term
      Definition of the third term.

----------

Finally, both variants of compact definition terms may be preceded by additional definition terms, each of them suffixed by two colons (::) in the usual way.

Normal and compact definition list items can be mixed arbitrarily and consecutive items are merged as usual into a definition list, but only the first item of a definition list determines whether the list as a whole gets the compact attribute or not.

It should be noted that the recognition of compact definition list items (the three spaces pattern) occurs for all non-indented lines in the current context. Headings and preformatted text are not affected, but in other cases, three or more spaces in a row within the actual text of a line should be avoided in order for this rule to get not unintentionally applied.

Inline Elements

The following character sequences are treated specially when they occur in the actual text of an element and are given here together with their meaning:

_Emphasized text_
Marks the enclosed text for normal emphasis (<em>, on input also <i>, <var>, <cite>, <u>).
*Strongly emphasized text*
Marks the enclosed text for strong emphasis (<strong>, on input also <b>, <dfn>, <th>).
'Sample text'
Marks the enclosed text as sample text (<code>, on input also <tt>, <samp>, <kbd>). This sequence is not recognized within preformatted text because it is not needed there.
[URL|Fragment|Link text]
Marks the enclosed link text as a hyperlink (URL not empty), an anchor (Fragment not empty), or both (<a name="Fragment" href="URL">). This sequence is not recognized within link texts. Both values of URL and Fragment cannot be empty at the same time and are also not allowed to contain white-space characters or any character from the set {}[|]<>". Note that a hyperlink that refers to an anchor within the same document is specified (in the usual way) with a URL value such as #Fragment. For further information see also section "URL Syntax".
<localpart@domain>
Marks the enclosed text as an e-mail address (<a href="mailto:localpart@domain">). This sequence is not recognized within link texts and preformatted text.

The given definitions can more or less obviously lead to misinterpretations in certain situations. In order to resolve many ambiguities, some additional rules are applied during the processing of text fragments in deciding whether a given character sequence represents valid markup or not. These rules can be described as follows, where the start or end of a text fragment also counts as white-space:

  1. The enclosed text cannot be empty, but an empty link text is allowed and is replaced by the non-empty value of either URL or Fragment in such a case, e.g. in [http://www.aksware.de/||].
  2. The enclosed text cannot start or end with a white-space character.
  3. A character on either side of a special sequence can only be white-space or any character from the set '`!"$%&@\/(){}[|]<>=?*+-~#,.;:_^. This rule is used to ensure that embedded special characters are left alone in certain contexts, e.g. in contractions (I'm right, aren't I?), in calculations (1*2+3*4) or in variable names (HTTP_USER_AGENT).
  4. A link text may contain nested pairs of brackets ([]), which means that every opening "[" must be closed by a "]" accordingly, e.g. in [refs.html#html||[HTML]].
  5. Both parts of an e-mail address are only allowed to contain letters, digits, hyphens (-) and dots (.). The localpart may also contain underscores (_), but the domain must contain at least one dot, e.g. <no_reply@aksware.de>.

The third rule used to be just an exclusion of letters and digits, but in the presence of Unicode, doing that is a quite difficult task. So, the sense has been reversed and a set of allowed characters is enumerated instead. This set includes all white-space and printable characters of the US-ASCII character set except letters and digits.

The processing of a text fragment consists of several steps and is performed in a left to right manner:

  1. A valid start character of a special character sequence or a valid prefix of a link up to and including the second "|" is searched for in the text fragment.
  2. If found, a valid end character that corresponds to the start character is searched for in the rest of the text fragment.
  3. If found too, then the text fragment is split into three parts:
    1. The first part in front of the start character is finished.
    2. The enclosed text fragment of the middle part is processed again and the result is finally marked up according to its special meaning.
    3. With the rest of the text fragment following the end character not being empty, the whole procedure starts over again.

Not finding a valid start character or prefix just terminates the processing of a text fragment, while not finding a valid end character causes the start character to be skipped and further processing continues right after it. Note that a text fragment can contain line breaks which count also as white-space.

All characters that immediately follow the start character and that are equal to it are made part of the final enclosed text. And in a similar way, all characters that immediately precede the end character and that are equal to it are made part of the final enclosed text. Note that these prefixes and suffixes are not processed again in step c2. Both cases do not apply to links. Example:

***A strongly emphasized text that includes two stars on both sides.***

Restrictions

The following restrictions in the processing of HTML documents should be taken into account: