Recent Advances in Applied Probability

Recent Advances in Applied Probability, 9780387233789 (0387233784), Springer, 2004

Text databases are becoming larger and larger, the best example being the
World Wide Web (or just Web). For this reason, the importance of the information
retrieval (IR) and related topics such as text mining, is increasing every
day [Baeza-Yates & Ribeiro-Neto, 1999]. However, doing experiments in large
text collections is not easy, unless the Web is used. In fact, although reference
collections such as TREC [Harman, 1995] are very useful, their size are several
orders of magnitude smaller than large databases. Therefore, scaling is an
important issue. One partial solution to this problem is to have good models
of text databases to be able to analyze new indices and searching algorithms
before making the effort of trying them in a large scale. In particular if our
application is searching the Web. The goals of this article are two fold: (1) to
present in an integrated manner many different results on how to model nat
ural language text and document collections, and (2) to show their relations,
consequences, advantages, and drawbacks.

We can distinguish three types of models: (1) models for static databases,
(2) models for dynamic databases, and (3) models for queries and their answers.
Models for static databases are the classical ones for natural language
text. They are based in empirical evidence and include the number of different
words or vocabulary (Heaps’ law), word distribution (Zipf’s law), word
length, distribution of document sizes, and distribution of words in documents.
We formally relate the Heaps’ and Zipf’s empirical laws and show that they
can be explained from a simple finite state model.

Dynamic databases can be handled by extensions of static models, but there
are several issues that have to be considered. The models for queries and their
answers have not been formally developed until now. Which are the correct
assumptions? What is a random query? How many occurrences of a query are
found? We propose specific models to answer these questions.

Comments

Amazing Books

Cisco Secure Internet Security Solutions

Cisco Press, 2001

Text concentrating on each member of the Cisco Secure product family, showing how to use each one to create an ultimately secure network for Internet use. Discusses threats posed by the Internet, such as hackers and viruses, and how to combat them using Cisco products. Also shows how to create firewalls and other barriers against unwanted attacks....

Starting Out with Python

Addison Wesley, 2008

In Starting Out with Python^TM, Tony Gaddis’ evenly paced, accessible coverage introduces students to the basics of programming and prepares them to transition into more complicated languages. Python, an easy-to-learn and increasingly popular object-oriented language, allows readers to become comfortable with the...

UNIX System Programming for System VR4 (Nutshell Handbooks)

O'Reilly, 1996

Any program worth its salt uses operating system services. Even a simple program, if practical, reads input and produces output. And, most applications have more complex needs. They need to find out the time, use the network, or start and communicate with other processes. Systems programming really means nothing more than writing...

Critical Thinking (Skill Builders)

Learning Express, 2004

Critical Thinking has a unique step-by-step approach to establish great critical thinking. It starts with a 35-question pre-test test to help diagnose strengths and weaknesses and then proceeds to offer strategies for improving reasoning skills. With Critical Thinking, readers can master the techniques of effective...

Embedded Systems and Software Validation (Morgan Kaufmann Series in Systems on Silicon)

Morgan Kaufmann, 2009

Modern embedded systems require high performance, low cost and low power consumption. Such systems typically consist of a heterogeneous collection of processors, specialized memory subsystems, and partially programmable or fixed-function components. This heterogeneity, coupled with issues such as hardware/software partitioning, mapping,...

Beginning Microsoft SQL Server 2008 Administration

Wrox Press, 2009

SQL Server 2008 introduces many new features that will change database administration procedures; many DBAs will be forced to migrate to SQL Server 2008. This book teaches you how to develop the skills required to successfully administer a SQL Server 2008 database; no prior experience is required. The material covers system installation and...