首页> 外文会议>Nordic conference of computational Linguistics >An Unsupervised Query Rewriting Approach Using N-gram Co-occurrence Statistics to Find Similar Phrases in Large Text Corpora

【24h】

An Unsupervised Query Rewriting Approach Using N-gram Co-occurrence Statistics to Find Similar Phrases in Large Text Corpora

机译：使用N-gram共现统计量的无监督查询重写方法以查找大文本语料库中的相似短语

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

We present our work towards developing a system that should find, in a large text corpus, contiguous phrases expressing similar meaning as a query phrase of arbitrary length. Depending on the use case, this task can be seen as a form of (phrase-level) query rewriting. The suggested approach works in a generative manner, is unsupervised and uses a combination of a semantic word n-gram model, a statistical language model and a document search engine. A central component is a distributional semantic model containing word n-grams vectors (or embeddings) which models semantic similarities between n-grams of different order. As data we use a large corpus of PubMed abstracts. The presented experiment is based on manual evaluation of extracted phrases for arbitrary queries provided by a group of evaluators. The results indicate that the proposed approach is promising and that the use of distributional semantic models trained with uni-, bi- and trigrams seems to work better than a more traditional uni-gram model.

机译：我们提出了我们的工作，以开发一种系统，该系统应在大文本语料库中找到表示与任意长度的查询短语相似的含义的连续短语。根据使用情况，此任务可以看作是（短语级）查询重写的一种形式。所建议的方法以生成方式工作，不受监督，并使用语义词n-gram模型，统计语言模型和文档搜索引擎的组合。中心组件是包含单词n-grams向量（或嵌入）的分布式语义模型，该词义对不同顺序的n-grams之间的语义相似性进行建模。作为数据，我们使用大量PubMed摘要。提出的实验基于对一组评估者提供的任意查询的提取短语的手动评估。结果表明，所提出的方法是有前途的，并且使用由单字组，双字组和三字组训练的分布语义模型似乎比传统的单字组模型更好。

著录项

来源
《Nordic conference of computational Linguistics》|2019年|131-139|共9页
会议地点 Turku(FI)
作者
Hans Moen; Laura-Maria Peltonen; Henry Suhonen; Hanna-Maria Matinolli; Riitta Mieronkoski; Kirsi Telen; Kirsi Terho; Tapio Salakoski; Sanna Salanterä;
展开▼
作者单位

Turku NLP Group Department of Future Technologies University of Turku Finland;

Department of Nursing Science University of Turku Finland;

Department of Nursing Science University of Turku Finland Turku University Hospital Finland;

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
入库时间 2022-08-26 14:42:09

相似文献

外文文献
中文文献
专利

1. Conducting sparse feature selection on arbitrarily long phrases in text corpora with a focus on interpretability [J] . Luke Miratrix, Robin Ackerman Statistical Analysis and Data Mining . 2016,第6期

机译：对文本语料库中的任意长短语进行稀疏特征选择，重点是可解释性
2. Conducting sparse feature selection on arbitrarily long phrases in text corpora with a focus on interpretability [J] . Miratrix Luke, Ackerman Robin Statistical Analysis and Data Mining . 2016,第6期

机译：在文本语料库中任意长的短语进行稀疏特征选择，重点是可解释性
3. Using large clinical corpora for query expansion in text-based cohort identification [J] . Dongqing Zhu, Stephen Wu, Ben Carterette, Journal of biomedical informatics. . 2014,第Null期

机译：在基于文本的队列识别中使用大型临床语料库进行查询扩展
4. An Unsupervised Query Rewriting Approach Using N-gram Co-occurrence Statistics to Find Similar Phrases in Large Text Corpora [C] . Hans Moen, Laura-Maria Peltonen, Henry Suhonen, Nordic conference of computational Linguistics . 2019

机译：一种无监督的查询重写方法，使用n-gram共同发生统计信息在大型文本语料库中找到类似的短语
5. Language-independent text learning with statistical n-gram language models. [D] . Peng, Fuchun. 2003

机译：统计n-gram语言模型的独立于语言的文本学习。
6. Automatic Entity Recognition and Typing from Massive Text Corpora: A Phrase and Network Mining Approach [O] . Xiang Ren, Ahmed El-Kishky, Chi Wang, -1

机译：大规模文本语料库的自动实体识别和键入：一种短语和网络挖掘方法
7. A New Method of N-gram Statistics for Large Number of n and Automatic Extraction of Words and Phrases from Large Text Data of Japanese [O] . Makoto Nagao, Shinsuke Mori 1994

机译：日语大文本数据中大量n和N语法自动提取的新方法
8. Preliminary Statistical Investigation into the Impace of an N-Gram Analysis Approach Based on World Syntactic Categories Toward Text Author Classification [R] . Diab, M. , Schuster, J. , Bock, P. 2000

机译：基于世界句法范畴对文本作者分类的N-gram分析方法的初步统计研究

An Unsupervised Query Rewriting Approach Using N-gram Co-occurrence Statistics to Find Similar Phrases in Large Text Corpora

摘要

著录项

相似文献

相关主题

期刊订阅