Handling Entity Normalization with no Annotated Corpus: Weakly Supervised Methods Based on Distributional Representation and Ontological Information

机译：处理实体归一化无注释语料库：基于分配表示和本体信息的弱监督方法

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Entity normalization (or entity linking) is an important subtask of information extraction that links entity mentions in text to categories or concepts in a reference vocabulary. Machine learning based normalization methods have good adaptability as long as they have enough training data per reference with a sufficient quality. Distributional representations are commonly used because of their capacity to handle different expressions with similar meanings. However, in specific technical and scientific domains, the small amount of training data and the relatively small size of specialized corpora remain major challenges. Recently, the machine learning-based CONTES method has addressed these challenges for reference vocabularies that are ontologies, as is often the case in life sciences and biomedical domains. Its performance is dependent on manually annotated corpus. Furthermore, like other machine learning based methods, parametrization remains tricky. We propose a new approach to address the scarcity of training data that extends the CONTES method by corpus selection, pre-processing and weak supervision strategies, which can yield high-performance results without any manually annotated examples. We also study which hyperparameters are most influential, with sometimes different patterns compared to previous work. The results show that our approach significantly improves accuracy and outperforms previous state-of-the-art algorithms.

机译：实体归一化（或实体链接）是信息提取的一个重要子任务，其将文本中的实体提及链接到参考词汇表中的类别或概念。基于机器学习的归一化方法具有良好的适应性，只要它们具有足够的质量，它们具有足够的训练数据。由于它们处理具有类似含义的不同表达的能力，通常使用分支表示。然而，在特定的技术和科学域中，少量培训数据和相对较小的专业化的Corpora仍然存在重大挑战。最近，基于机器学习的Contes方法已经解决了这些挑战，即在本体的参考词汇表，通常是生命科学和生物医学领域的案例。其性能取决于手动注释的语料库。此外，与其他基于机器的学习的方法一样，参数化仍然棘手。我们提出了一种新的方法来解决培训数据的稀缺，通过语料库选择，预处理和监督策略扩展符合方法，可以在没有任何手动注释的例子的情况下产生高性能结果。我们还研究哪些超参数是最具影响力的，与以前的工作相比，有时不同的模式。结果表明，我们的方法显着提高了先前最先进的算法的准确性和优异。

著录项

来源
《International Conference on Language Resources and Evaluation》|2020年|1959-1966|共8页
会议地点
作者
Arnaud Ferre; Robert Bossy; Mouhamadou Ba; Louise Deleger; Thomas Lavergne; Pierre Zweigenbaum; Claire Nedellec;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Information Extraction; entity normalization; entity linking; supervised learning;

机译：信息提取;实体归一化;实体链接;监督学习;

相似文献

外文文献
中文文献
专利

1. A clinical trials corpus annotated with UMLS entities to enhance the access to evidence-based medicine [J] . Leonardo Campillos-Llanos, Ana Valverde-Mateos, Adrián Capllonch-Carrión, BMC Medical Informatics and Decision Making . 2021,第1期

机译：临床试验语料库用UMLS实体注释，以增强对循证医学的获取
2. A French clinical corpus with comprehensive semantic annotations: development of the Medical Entity and Relation LIMSI annotated Text corpus (MERLOT) [J] . Campillos Leonardo, Deleger Louise, Grouin Cyril, Language Resources and Evaluation . 2018,第2期

机译：具有全面语义注释的法语临床语料库：医学实体和关系LIMSI注释文本语料库（MERLOT）的开发
3. Optimisation of the Largest Annotated Tibetan Corpus Combining Rule-based, Memory-based, and Deep-learning Methods [J] . Meelen Marieke, Roux Elie, Hill Nathan ACM transactions on Asian and low-resource language information processing . 2021,第1期

机译：优化规则基础，基于内存和深学习方法的最大注释的藏语语料库
4. Tools and methodologies for annotating syntax and named entities in the National Corpus of Polish [C] . Proceedings of the International Multiconference on Computer Science and Information Technology . 2010

机译：波兰国家语料库中用于注释语法和命名实体的工具和方法
5. Entity Analysis with Weak Supervision: Typing, Linking, and Attribute Extraction. [D] . Ling, Xiao. 2015

机译：具有弱监督的实体分析：键入，链接和属性提取。
6. A clinical trials corpus annotated with UMLS entities to enhance the access to evidence-based medicine [O] . Leonardo Campillos-Llanos, Ana Valverde-Mateos, Adrián Capllonch-Carrión, 2021

机译：用UMLS实体注释的临床试验语料库以提高对循证医学的获取
7. NERO: A Biomedical Named-entity (Recognition) Ontology with a Large, Annotated Corpus Reveals Meaningful Associations Through Text Embedding [O] . Kanix Wang, Robert Stevens, Halima Alachram, 2020

机译：Nero：一种生物医学命名实体（识别）本体，具有大，注释的语料库，通过文本嵌入显示有意义的关联

Handling Entity Normalization with no Annotated Corpus: Weakly Supervised Methods Based on Distributional Representation and Ontological Information

摘要

著录项

相似文献

相关主题

期刊订阅