| Text databases are becoming larger and larger, the best example being the World Wide Web (or just Web). For this reason, the importance of the information retrieval (IR) and related topics such as text mining, is increasing every day [Baeza-Yates & Ribeiro-Neto, 1999]. However, doing experiments in large text collections is not easy, unless the Web is used. In fact, although reference collections such as TREC [Harman, 1995] are very useful, their size are several orders of magnitude smaller than large databases. Therefore, scaling is an important issue. One partial solution to this problem is to have good models of text databases to be able to analyze new indices and searching algorithms before making the effort of trying them in a large scale. In particular if our application is searching the Web. The goals of this article are two fold: (1) to present in an integrated manner many different results on how to model nat ural language text and document collections, and (2) to show their relations, consequences, advantages, and drawbacks.
We can distinguish three types of models: (1) models for static databases, (2) models for dynamic databases, and (3) models for queries and their answers. Models for static databases are the classical ones for natural language text. They are based in empirical evidence and include the number of different words or vocabulary (Heaps’ law), word distribution (Zipf’s law), word length, distribution of document sizes, and distribution of words in documents. We formally relate the Heaps’ and Zipf’s empirical laws and show that they can be explained from a simple finite state model.
Dynamic databases can be handled by extensions of static models, but there are several issues that have to be considered. The models for queries and their answers have not been formally developed until now. Which are the correct assumptions? What is a random query? How many occurrences of a query are found? We propose specific models to answer these questions. |