首页> 外文会议>Nordic conference of computational Linguistics >An Unsupervised Query Rewriting Approach Using N-gram Co-occurrence Statistics to Find Similar Phrases in Large Text Corpora
【24h】

An Unsupervised Query Rewriting Approach Using N-gram Co-occurrence Statistics to Find Similar Phrases in Large Text Corpora

机译:一种无监督的查询重写方法,使用n-gram共同发生统计信息在大型文本语料库中找到类似的短语

获取原文

摘要

We present our work towards developing a system that should find, in a large text corpus, contiguous phrases expressing similar meaning as a query phrase of arbitrary length. Depending on the use case, this task can be seen as a form of (phrase-level) query rewriting. The suggested approach works in a generative manner, is unsupervised and uses a combination of a semantic word n-gram model, a statistical language model and a document search engine. A central component is a distributional semantic model containing word n-grams vectors (or embeddings) which models semantic similarities between n-grams of different order. As data we use a large corpus of PubMed abstracts. The presented experiment is based on manual evaluation of extracted phrases for arbitrary queries provided by a group of evaluators. The results indicate that the proposed approach is promising and that the use of distributional semantic models trained with uni-, bi- and trigrams seems to work better than a more traditional uni-gram model.
机译:我们展示了我们在开发一个应该找到的系统中的工作,以大量语料库,连续的短语表达与任意长度的查询短语相似的含义。根据用例,此任务可以被视为(短语级)查询重写的形式。建议的方法以生成方式工作,无监督,并使用语义词N-GRAM模型,统计语言模型和文档搜索引擎的组合。中央组件是包含Word N-Grams向量(或Embeddings)的分布语义模型,其在不同顺序的n克之间模拟语义相似之处。作为数据,我们使用的大型摘要摘要。所提出的实验是基于一组评估人员提供的任意查询提取的短语的手动评估。结果表明,该拟议的方法是有前途的,并且使用由Uni-,Bi-and Trigrams培训的分配语义模型似乎比更传统的Uni-Gram模型更好地工作。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号