首页> 外文OA文献 >Improvement to Chinese information retrieval by incorporating word segmentation and query expansion
【2h】

Improvement to Chinese information retrieval by incorporating word segmentation and query expansion

机译:通过结合分词和查询扩展来改善中文信息检索

摘要

The increasing diversity of the Internet has created a vast number of multilingual resources on the Web. A huge number of these documents are written in various languages other than English. Consequently, the demand for searching in non-English languages is growing exponentially. It is desirable that a search engine can search for information over collections of documents in other languages. This research investigates the techniques for developing high-quality Chinese information retrieval systems. A distinctive feature of Chinese text is that a Chinese document is a sequence of Chinese characters with no space or boundary between Chinese words. This feature makes Chinese information retrieval more difficult since a retrieved document which contains the query term as a sequence of Chinese characters may not be really relevant to the query since the query term (as a sequence Chinese characters) may not be a valid Chinese word in that documents. On the other hand, a document that is actually relevant may not be retrieved because it does not contain the query sequence but contains other relevant words. In this research, we propose two approaches to deal with the problems. In the first approach, we propose a hybrid Chinese information retrieval model by incorporating word-based techniques with the traditional character-based techniques. The aim of this approach is to investigate the influence of Chinese segmentation on the performance of Chinese information retrieval. Two ranking methods are proposed to rank retrieved documents based on the relevancy to the query calculated by combining character-based ranking and word-based ranking. Our experimental results show that Chinese segmentation can improve the performance of Chinese information retrieval, but the improvement is not significant if it incorporates only Chinese segmentation with the traditional character-based approach. In the second approach, we propose a novel query expansion method which applies text mining techniques in order to find the most relevant words to extend the query. Unlike most existing query expansion methods, which generally select the highly frequent indexing terms from the retrieved documents to expand the query. In our approach, we utilize text mining techniques to find patterns from the retrieved documents that highly correlate with the query term and then use the relevant words in the patterns to expand the original query. This research project develops and implements a Chinese information retrieval system for evaluating the proposed approaches. There are two stages in the experiments. The first stage is to investigate if high accuracy segmentation can make an improvement to Chinese information retrieval. In the second stage, a text mining based query expansion approach is implemented and a further experiment has been done to compare its performance with the standard Rocchio approach with the proposed text mining based query expansion method. The NTCIR5 Chinese collections are used in the experiments. The experiment results show that by incorporating the text mining based query expansion with the hybrid model, significant improvement has been achieved in both precision and recall assessments.
机译:Internet的多样性日益增加,在Web上创建了大量的多语言资源。这些文档中有大量用英语以外的其他多种语言编写的。因此,以非英语语言进行搜索的需求呈指数增长。期望搜索引擎可以搜索其他语言的文档集合中的信息。本研究调查了开发高质量中文信息检索系统的技术。中文文本的一个显着特征是中文文档是一系列中文字符,中文单词之间没有空格或边界。此功能使中文信息的检索更加困难,因为包含查询词的汉字序列的检索文档可能与查询没有真正的关系,因为查询词(作为汉字序列)可能不是有效的中文单词。该文件。另一方面,可能不检索实际相关的文档,因为它不包含查询序列,但包含其他相关的单词。在这项研究中,我们提出了两种方法来解决这些问题。在第一种方法中,我们通过将基于单词的技术与基于传统字符的技术相结合,提出了一种混合的中文信息检索模型。这种方法的目的是研究中文分割对中文信息检索性能的影响。提出了两种排序方法,根据与基于字符的排序和基于单词的排序相结合而计算出的查询的相关性,对检索到的文档进行排序。我们的实验结果表明,中文分割可以提高中文信息检索的性能,但是如果仅将中文分割与基于传统字符的方法结合在一起,则这种改进并不显着。在第二种方法中,我们提出了一种新颖的查询扩展方法,该方法应用文本挖掘技术来查找最相关的词以扩展查询。与大多数现有的查询扩展方法不同,该方法通常从检索到的文档中选择频繁出现的索引词来扩展查询。在我们的方法中,我们利用文本挖掘技术从检索到的文档中找到与查询词高度相关的模式,然后在模式中使用相关词来扩展原始查询。该研究项目开发并实施了中文信息检索系统,以评估所提出的方法。实验分为两个阶段。第一个阶段是研究高精度分割是否可以改善中文信息检索。在第二阶段,实现了一种基于文本挖掘的查询扩展方法,并进行了进一步的实验,以将其性能与标准Rocchio方法与所提出的基于文本挖掘的查询扩展方法进行比较。实验中使用了NTCIR5中文馆藏。实验结果表明,通过将基于文本挖掘的查询扩展与混合模型相结合,在准确性和召回率评估方面都取得了显着改善。

著录项

  • 作者

    Li Zhihan;

  • 作者单位
  • 年度 2009
  • 总页数
  • 原文格式 PDF
  • 正文语种 {"code":"en","name":"English","id":9}
  • 中图分类

相似文献

  • 外文文献
  • 中文文献
  • 专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号