首页> 外文会议>Nordic conference of computational Linguistics >An Unsupervised Query Rewriting Approach Using N-gram Co-occurrence Statistics to Find Similar Phrases in Large Text Corpora
【24h】

An Unsupervised Query Rewriting Approach Using N-gram Co-occurrence Statistics to Find Similar Phrases in Large Text Corpora

机译:使用N-gram共现统计量的无监督查询重写方法以查找大文本语料库中的相似短语

获取原文

摘要

We present our work towards developing a system that should find, in a large text corpus, contiguous phrases expressing similar meaning as a query phrase of arbitrary length. Depending on the use case, this task can be seen as a form of (phrase-level) query rewriting. The suggested approach works in a generative manner, is unsupervised and uses a combination of a semantic word n-gram model, a statistical language model and a document search engine. A central component is a distributional semantic model containing word n-grams vectors (or embeddings) which models semantic similarities between n-grams of different order. As data we use a large corpus of PubMed abstracts. The presented experiment is based on manual evaluation of extracted phrases for arbitrary queries provided by a group of evaluators. The results indicate that the proposed approach is promising and that the use of distributional semantic models trained with uni-, bi- and trigrams seems to work better than a more traditional uni-gram model.
机译:我们提出了我们的工作,以开发一种系统,该系统应在大文本语料库中找到表示与任意长度的查询短语相似的含义的连续短语。根据使用情况,此任务可以看作是(短语级)查询重写的一种形式。所建议的方法以生成方式工作,不受监督,并使用语义词n-gram模型,统计语言模型和文档搜索引擎的组合。中心组件是包含单词n-grams向量(或嵌入)的分布式语义模型,该词义对不同顺序的n-grams之间的语义相似性进行建模。作为数据,我们使用大量PubMed摘要。提出的实验基于对一组评估者提供的任意查询的提取短语的手动评估。结果表明,所提出的方法是有前途的,并且使用由单字组,双字组和三字组训练的分布语义模型似乎比传统的单字组模型更好。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号