Philipp J. Stolka
Practical Natural Language Processing / Proseminar Künstliche Intelligenz / SS 1998 / Philipp Stolka




3.1 - The Grammar of Formal Languages

Certain rules apply to the structure of a message (a series of symbols with a special meaning).
Of course, the message must be formulated in a language common to both the sender and the hearer; this can be either a formal - invented - or natural language - like English, German or Chinese. Which type is chosen for communication is rather irrelevant for our investigation because both share enough properties to be represented by the same means (at least to a certain extent, as natural languages include lots of ambiguities, metaphors, anaphora etc. that are not allowed in formal ones that much).
Natural languages can be partly represented by special formal languages. The basic parts of such a language are terminal symbols - symbols (tokens) that are final, i.e. words in natural languages. These terminal symbols form phrases - parts of speech that stand for certain grammatical categories like nouns, verbs etc. Noun phrases, for example, describe nouns in detail - "the red herring", "the one I saw" etc., while verb phrases express action, behavior or state - "is dead", "reads quickly" and so on. We will refer to noun phrases by NP and to verb phrases by VP. These and other categories all combine to create a complete sentence S. These groups are called nonterminal symbols, as they don't occur in complete sentences any more. In this context a language is a set of strings composed from terminal symbols according to a series of rules, the grammar.

The terminal symbols (the words) must be present in a lexicon - a list that is subdivided into sections for nouns, verbs, adjectives etc. that includes all allowed words for this language. (Logically, these sections fall into two classes: closed classes, these contain a rigidly fixed set of words (articles, prepositions, and conjunctions); and open classes, their set of words can change over time (nouns, verbs, adjectives and adverbs). This distinction will not bother us here, though.)
The grammar itself provides the framework for building sentences from phrases which in turn are built from terminal symbols. There are a few distinct classes of grammars, which differ in the set of languages they can produce (their generative capacity). Before understanding this, we need to know how new phrases are produced from terminal symbols and other phrases.

The single grammar entries are called rewrite rules because they substitute a part of the first string to obtain another. They have a similar form to this sample rule
S - NP VP
which expresses the fact that to build a full-fledged sentence S you need a noun phrase NP and a verb phrase VP (or vice versa: you can combine a NP and a VP to obtain a S). Of course there are analogous rewrite rules for NP, VP and all other phrase types. There may also be several rewrite rules with the same left-hand side symbol; e.g. in addition to the one above there might exist a
S - NP VP CON NP VP,
where CON stands for a conjunction. This rule could also be expressed as
S - S CON S,
as any occurrence of S can be replaced by NP VP according to our first rule.

Now, you can develop this method into different directions, leading to different classes of grammars. N.Chomsky has stated that there are four of them, sorted by increasing generative capacity here:
First, there are regular grammars, which allow only for constructs like
N - TN,
where N and T stand for one (exactly one) nonterminal or terminal symbol, respectively. This is the least powerful class.
Then, there are context-free grammars (CFGs), in which it is allowed to have rewrite rules like
N - (N|T)*,
that is, one nonterminal symbol can be replaced by any combination of terminal and nonterminal symbols.
This class is succeeded by context-sensitive grammars; here, rules like
(N|T)*1 - (N|T)*2
appear: Any combination of terminals and nonterminals can be replaced by any other combination when the string in scrutiny exactly matches the left-hand side of the rule. However, the right-hand side must contain an equal amount of or more symbols than the left-hand side - the sentence can only be expanded, not shortened:
|(N|T)*1| < |(N|T)*2|.
The most powerful class are recursively enumerable grammars. There are no restrictions as to the left- or right-hand side of the rewrite rules.

prev: 3 - Syntax: How to Put It
this: 3.1 - The Grammar of Formal Languages
next: 3.2 - Parsing And Efficiency

back to main page...

Mail to Philipp Stolka.
Last modified: Tue Sep 22 21:21:55 MEST 1998