【24h】

Query Based Chinese Phrase Extraction for Site Search

机译:基于查询的中文短语提取用于站点搜索

获取原文
获取原文并翻译 | 示例

摘要

Word segmentation(WS) is one of the major issues of information processing in character-based languages, for there are no explicit word boundaries in these languages. Moreover, a combination of multiple continuous words, a phrase, is usually a minimum meaningful unit. Although much work has been done on WS, in site web search, little has been explored to mine site-specific knowledge from user query log for both more accurate WS and better retrieval performance. This paper proposes a novel, statistics-based method to extract phrases based on user query log. The extracted phrases, combined with a general, static dictionary, construct a dynamic, site-specific dictionary. According to the dictionary, web documents are segmented into phrases and words, which are kept as separate index terms to build phrase enhanced index for site search. The experiment result shows that our approach greatly improves the retrieval performance. It also helps to detect many out-of-vocabulary words, such as site-specific phrases, newly created words and names of people and locations, which are difficult to process with a general, static dictionary.
机译:分词(WS)是基于字符的语言中信息处理的主要问题之一,因为在这些语言中没有明确的词边界。此外,多个连续单词(一个短语)的组合通常是最小有意义的单位。尽管已在WS上进行了大量工作,但是在站点Web搜索中,对于从用户查询日志中挖掘站点特定知识的探索很少,以获取更准确的WS和更好的检索性能。本文提出了一种新的基于统计的基于用户查询日志的短语提取方法。提取的短语与一般的静态词典结合,构成动态的,针对特定地点的词典。根据字典,Web文档被分为短语和单词,它们被保留为单独的索引词,以建立用于站点搜索的短语增强索引。实验结果表明,该方法大大提高了检索性能。它还有助于检测许多词汇量不足的单词,例如特定于站点的短语,新创建的单词以及人和地点的名称,而这些单词很难用通用的静态字典来处理。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号