| Practical Natural Language Processing / Proseminar Künstliche Intelligenz / SS 1998 / Philipp Stolka |
In all the earlier observations we have assumed that we are given a sequence of words. However, when there is a text to be processed, there are a few more steps to do before parsing.
First, you have to find the single tokens. Usually, words are distinguishable by the intermediate spaces, but in some cases (e.g. at the end of a line) there are other facts to be considered (hyphens etc.). Nevertheless, this tokenization is quite straightforward as problems occurring at this stage are uncommon and can be corrected later on in the few cases they occur.
The second step is morphological analysis. At this time, distinctions like singular/plural, compositions due to word category change, and compositions of two different words are accounted for (inflectional morphology, derivational morphology and compounding, respectively).
Then there is dictionary lookup. Here, the found tokens are searched in the dictionary and the definition (their grammatic category) is returned, as this is needed for later parsing.
A word might not appear in the dictionary, though. Several things can be done now, during error recovery: The morphological analyzer can try to find out what category the word could be in, according to postfixes, capitalization or format, or you might decide that it is a spelling mistake and try to find the word that is most probably the right word in this context. This can be done either with character-based models that search the word space for tokens that lie next to the examined word in terms of typing errors (double prints, swaps, omissions etc.), or with sound-based models that transform the word into phonetic transcription and search words that sound the same.
| prev: | 3.2.2 - The Chart Parser |
| this: | 3.3 - Unknown Words |
| next: | 4 - Semantics: The Right Interpretation |