| Practical Natural Language Processing / Proseminar Künstliche Intelligenz / SS 1998 / Philipp Stolka |
Whenever you are given a text, you need to extract information from it and make assumptions about the received data. Obviously, this would be a duty that calls for automated processing. It is pointed out in [AI23] how information storage has changed from database to text, and what difficulties might arise from that.
Nowadays, instead of using structured constructs to query the given database, you first need to find out where the data in question is stored and then deduce the relevant facts from it. While searching for documents, you can hardly foresee where and what information will turn up, and so you need some flexibility in posing your questions. Mostly you are interested in texts concerning a specific topic, thus you can classify all documents in scope according to certain criteria like word occurrence etc. This is the task of information retrieval, text categorization and data extraction applications.
From a given set of documents, information retrieval strives to find a subset that corresponds best to the query. In one way or another, such a program searches all documents for occurrences of the user-provided keywords and returns the relevant ones.
In the beginning of IR history, documents that existed on paper were stored in machine-readable form and manually indexed. Performing the indexing in the computer's memory allows for much larger and more precise indexes, but is also more time-consuming and therefore expensive. So it would be preferable to skip the indexing at all and operate not on pre-made indexes but rather on the whole texts. This is the idea behind free-text searching which soon became popular.
Nevertheless, critics argued that free-text search might turn out not to be as successful as indexed searching. Indexing is not merely taking words from the document and putting them in a structured database but making a choice as to the quality of the selected keywords, too. Not all words have equally important roles in a text - e.g., conjunctions as "and" and "or" or articles like "a" are so ubiquitous that using them as search keywords would be very inappropriate, as nearly all texts include them.
So how can we decide if the free-text search method is acceptable or not? This question led to the definition of recall and precision in IR, pushed mainly by Cyril Cleverdon of the Cranfield College of Aeronautics. Recall is the fraction of relevant documents that are retrieved, and precision refers to the fraction of retrieved documents that are relevant (see [MLES]). Obviously, these two quantities behave in an antiproportional way: If you want to increase recall, you have to face diminishing precision and vice versa.
Eventually it turned out that automatic free-text search has no worse results than manual indexing, and consequently this information retrieval technique experienced intensified research.
One significant performance boost was achieved by relevance feedback which appends words from found documents known to be relevant to the keyword string. This provides a way to find relevant documents with increasing precision.
In the 1970s, with the advent of large-scale electronic data processing, larger amounts of data were available online, and information retrieval techniques became more and more important. In the same time, vector-space models were invented. This means that the earlier error-prone use of Boolean combinations of keywords using "AND" and "OR" for searching became obsolete. Instead, the text is represented as an n-dimensional vector, with each dimension standing for one particular token from the text. This vector is being compared to the keyword vector, and when it surpasses a certain similarity with the key vector, the corresponding text becomes a member of the relevant document list. This model can be modified in many ways, e.g. by assigning a token a larger weight if it is a good discriminator (probabilistic information retrieval).
These models plainly leave out any syntactic or semantic aspects, they work "almost entirely at word level". The artificial intelligence community tried to prove that analyzing the text with more sophisticated natural language processing methods can give better results (although they did less real-world experiments than the IR people did), but they failed in showing the superiority of automatic intellectual analysis approaches.
Some applications, however, performed quite well - e.g. the aforementioned LUNAR system and similar projects led by Roger Shank at Yale, which tried to recognize certain syntactic structures and fill templates with representations of what is spoken about. This in turn resulted in more research in appropriate knowledge representation languages, logical languages that can formulate general world knowledge to be used in expert systems. But, working on structured databases rather than free-form text files these approaches were restricted to very specific domains.
However, this may be a favorable restriction: In the news service market, several providers perform automatic text categorization ("sorting texts into fixed categories"). While there was usually a large human work factor involved in this task, NLP systems can reliably take over and minimize both costs and inconsistencies while maximizing speed.
A similar task is data extraction. Here, you are given a text; from this you try to extract whatever information is enclosed and put it in a database available for later querying. Mostly NLP techniques are used for that, although IR might give you a hint in which direction to go.
| prev: | 2.2 - Database Access |
| this: | 2.3 - Dealing with text |
| next: | 3 - Syntax: How to Put It |