首页> 外文会议>Nordic conference of computational Linguistics >An Unsupervised Query Rewriting Approach Using N-gram Co-occurrence Statistics to Find Similar Phrases in Large Text Corpora

【24h】

An Unsupervised Query Rewriting Approach Using N-gram Co-occurrence Statistics to Find Similar Phrases in Large Text Corpora

机译：一种无监督的查询重写方法，使用n-gram共同发生统计信息在大型文本语料库中找到类似的短语

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

We present our work towards developing a system that should find, in a large text corpus, contiguous phrases expressing similar meaning as a query phrase of arbitrary length. Depending on the use case, this task can be seen as a form of (phrase-level) query rewriting. The suggested approach works in a generative manner, is unsupervised and uses a combination of a semantic word n-gram model, a statistical language model and a document search engine. A central component is a distributional semantic model containing word n-grams vectors (or embeddings) which models semantic similarities between n-grams of different order. As data we use a large corpus of PubMed abstracts. The presented experiment is based on manual evaluation of extracted phrases for arbitrary queries provided by a group of evaluators. The results indicate that the proposed approach is promising and that the use of distributional semantic models trained with uni-, bi- and trigrams seems to work better than a more traditional uni-gram model.

机译：我们展示了我们在开发一个应该找到的系统中的工作，以大量语料库，连续的短语表达与任意长度的查询短语相似的含义。根据用例，此任务可以被视为（短语级）查询重写的形式。建议的方法以生成方式工作，无监督，并使用语义词N-GRAM模型，统计语言模型和文档搜索引擎的组合。中央组件是包含Word N-Grams向量（或Embeddings）的分布语义模型，其在不同顺序的n克之间模拟语义相似之处。作为数据，我们使用的大型摘要摘要。所提出的实验是基于一组评估人员提供的任意查询提取的短语的手动评估。结果表明，该拟议的方法是有前途的，并且使用由Uni-，Bi-and Trigrams培训的分配语义模型似乎比更传统的Uni-Gram模型更好地工作。

著录项

来源
《Nordic conference of computational Linguistics》|2019年|xx 410 p.|共9页
会议地点
作者
Hans Moen; Laura-Maria Peltonen; Henry Suhonen; Hanna-Maria Matinolli; Riitta Mieronkoski; Kirsi Telen; Kirsi Terho; Tapio Salakoski; Sanna Salanter?;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类程序设计、软件工程;
关键词
入库时间 2022-08-20 20:19:24

相似文献

外文文献
中文文献
专利

1. Conducting sparse feature selection on arbitrarily long phrases in text corpora with a focus on interpretability [J] . Luke Miratrix, Robin Ackerman Statistical Analysis and Data Mining . 2016,第6期

机译：对文本语料库中的任意长短语进行稀疏特征选择，重点是可解释性
2. Conducting sparse feature selection on arbitrarily long phrases in text corpora with a focus on interpretability [J] . Miratrix Luke, Ackerman Robin Statistical Analysis and Data Mining . 2016,第6期

机译：在文本语料库中任意长的短语进行稀疏特征选择，重点是可解释性
3. Using large clinical corpora for query expansion in text-based cohort identification [J] . Dongqing Zhu, Stephen Wu, Ben Carterette, Journal of biomedical informatics. . 2014,第Null期

机译：在基于文本的队列识别中使用大型临床语料库进行查询扩展
4. An Unsupervised Query Rewriting Approach Using N-gram Co-occurrence Statistics to Find Similar Phrases in Large Text Corpora [C] . Hans Moen, Laura-Maria Peltonen, Henry Suhonen, Nordic conference of computational Linguistics . 2019

机译：使用N-gram共现统计量的无监督查询重写方法以查找大文本语料库中的相似短语
5. Language-independent text learning with statistical n-gram language models. [D] . Peng, Fuchun. 2003

机译：统计n-gram语言模型的独立于语言的文本学习。
6. Automatic Entity Recognition and Typing from Massive Text Corpora: A Phrase and Network Mining Approach [O] . Xiang Ren, Ahmed El-Kishky, Chi Wang, -1

机译：大规模文本语料库的自动实体识别和键入：一种短语和网络挖掘方法
7. A New Method of N-gram Statistics for Large Number of n and Automatic Extraction of Words and Phrases from Large Text Data of Japanese [O] . Makoto Nagao, Shinsuke Mori 1994

机译：日语大文本数据中大量n和N语法自动提取的新方法
8. Preliminary Statistical Investigation into the Impace of an N-Gram Analysis Approach Based on World Syntactic Categories Toward Text Author Classification [R] . Diab, M. , Schuster, J. , Bock, P. 2000

机译：基于世界句法范畴对文本作者分类的N-gram分析方法的初步统计研究

An Unsupervised Query Rewriting Approach Using N-gram Co-occurrence Statistics to Find Similar Phrases in Large Text Corpora

摘要

著录项

相似文献

相关主题

期刊订阅