The primary purpose of an information retrieval system is to retrieve all the relevant documents, which are relevant to the user query. The Latent Semantic Indexing (LSI) based ad hoc document retrieval task investigates the performance of retrieval systems that search a static set of documents using new questions/queries. Performance of LSI has been tested for several smaller datasets (e.g., MED, CISI abstracts etc) however, LSI has not been tested for a large dataset. In this research, we concentrated on the performance of LSI on large dataset. Stop word list and term weighting schemes are two key parameters in the area of information retrieval. We investigated the performance of LSI by using three different set of stop word lists and, also, without removing the stop words from the test collection. We also applied three different term-weighting (raw term frequency, log-entropy, and tf-idf) schemes to measure retrieval performance of LSI. We observed that, firstly, for a LSI based ad hoc information retrieval system, a tailored stop word list must be assembled for every unique large dataset. Secondly, the use of tf-idf term weighting scheme shows better retrieval performance than log-entropy and raw term frequency weighting schemes even when the test collection became large. --P. ii.
展开▼