Similarity-based matching is widely used in the vector space model. However, the widespread adoption of similarity-based matching is hampered by disagreements over how similarity measures should be constructed and how large databases should be indexed so the similarity matching is even possible. This thesis intends to overcome these hindrances and to establish a theoretical basis and implementation guidelines for applying similarity-based matching in Web retrieval.; The thesis analyzes the vector space model and shows that Web space would be modeled more exactly as a curved space rather than as a Euclidean space. Based on this, the thesis claims that it is inappropriate to attempt to apply a single similarity/dissimilarity measure globally on Web space. The thesis proposes a Riemann space model that explains previously unexplained phenomena. In the Riemann space model, dissimilarity functions are integrated into a single form of geodesic distances, which can be locally computed in a uniform formula. To some extent, this answers the long-existing open problem of identifying conditions for the use of a particular similarity/dissimilarity measure.; According to the theory of the Riemann space model, we propose a multi-stage approach that combines exact matching and partial matching in the design of new Web retrieval systems. In this approach, a retrieval system first forms a neighborhood of a query. This can be done using exact matching. Then in the chosen neighborhood, more complicated similarity-based matching is performed. The documents are ranked according to their geodesic distances to the query. This is equivalent to using a ranking function specially designed for the given neighborhood. Since the similarity-based matching is performed only in a neighborhood, the computational cost involved in the search process would be reduced. The Riemann space model provides a sound theoretical basis for this multi-stage approach.; As a demonstration of application, we designed and implemented a personal Web retrieval (PWR) system. Different from current search engines, subject trees, and metasearch engines, this system is a client side program. It works like a personal secretary. It reads Web documents, ranks them according to their geodesic distances to the query, and also considers the user's general search interests. It can be viewed as a prototype of intelligent Web retrieval systems.
展开▼