首页> 外文会议>International Conference on Computational Linguistics >What Does This Acronym Mean? Introducing a New Dataset for Acronym Identification and Disambiguation
【24h】

What Does This Acronym Mean? Introducing a New Dataset for Acronym Identification and Disambiguation

机译:这个首字母缩略词是什么意思? 引入新数据集以进行缩写识别和歧义

获取原文

摘要

Acronyms are the short forms of phrases that facilitate conveying lengthy sentences in documents and serve as one of the mainstays of writing. Due to their importance, identifying acronyms and corresponding phrases (i.e., acronym identification (AI)) and finding the correct meaning of each acronym (i.e., acronym disambiguation (AD)) are crucial for text understanding. Despite the recent progress on this task, there are some limitations in the existing datasets which hinder further improvement. More specifically, limited size of manually annotated AI datasets or noises in the automatically created acronym identification datasets obstruct designing advanced high-performing acronym identification models. Moreover, the existing datasets are mostly limited to the medical domain and ignore other domains. In order to address these two limitations, we first create a manually annotated large AI dataset for scientific domain. This dataset contains 17,506 sentences which is substantially larger than previous scientific AI datasets. Next, we prepare an AD dataset for scientific domain with 62,441 samples which is significantly larger than previous scientific AD dataset. Our experiments show that the existing state-of-the-art models fall far behind human-level performance on both datasets proposed by this work. In addition, we propose a new deep learning model which utilizes the syntactical structure of the sentence to expand an ambiguous acronym in a sentence. The proposed model outperforms the state-of-the-art models on the new AD dataset, providing a strong baseline for future research on this dataset .
机译:首字母缩略词是简短的短语形式,便于在文件中传送冗长的句子并作为写作的主要句子。由于它们的重要性,识别缩略语和相应的短语(即,缩写识别(AI))并找到每个首字母缩略词的正确含义(即,首字母缩略词歧义(AD))对于文本了解至关重要。尽管最近对此任务进行了进展,但现有数据集有一些限制,这些数据集会妨碍进一步改进。更具体地说,在自动创建的缩写识别数据集中,在自动创建的缩写识别数据集中有限的手动注释的AI数据集或噪声妨碍了设计先进的高性能缩写识别模型。此外,现有数据集主要限于医疗领域并忽略其他域。为了解决这两个限制,我们首先为科学域创建一个手动注释的大型AI数据集。此数据集包含17,506个句子,它比以前的科学AI数据集大得多。接下来,我们为科学域准备一个带有62,441个样本的广告数据集,该样本明显大于以前的科学广告数据集。我们的实验表明,现有的最先进模型远远落后于这项工作提出的两个数据集的人力级别表现。此外,我们提出了一种新的深度学习模式,它利用句子的句法结构来扩展句子中的含糊不清的缩写。拟议的模型在新广告数据集上占据了最先进的模型,为此数据集进行了未来的研究,为未来的研究提供了强大的基准。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号