首页> 外文会议>International Conference on Language Resources and Evaluation >Handling Entity Normalization with no Annotated Corpus: Weakly Supervised Methods Based on Distributional Representation and Ontological Information
【24h】

Handling Entity Normalization with no Annotated Corpus: Weakly Supervised Methods Based on Distributional Representation and Ontological Information

机译:处理实体归一化无注释语料库:基于分配表示和本体信息的弱监督方法

获取原文

摘要

Entity normalization (or entity linking) is an important subtask of information extraction that links entity mentions in text to categories or concepts in a reference vocabulary. Machine learning based normalization methods have good adaptability as long as they have enough training data per reference with a sufficient quality. Distributional representations are commonly used because of their capacity to handle different expressions with similar meanings. However, in specific technical and scientific domains, the small amount of training data and the relatively small size of specialized corpora remain major challenges. Recently, the machine learning-based CONTES method has addressed these challenges for reference vocabularies that are ontologies, as is often the case in life sciences and biomedical domains. Its performance is dependent on manually annotated corpus. Furthermore, like other machine learning based methods, parametrization remains tricky. We propose a new approach to address the scarcity of training data that extends the CONTES method by corpus selection, pre-processing and weak supervision strategies, which can yield high-performance results without any manually annotated examples. We also study which hyperparameters are most influential, with sometimes different patterns compared to previous work. The results show that our approach significantly improves accuracy and outperforms previous state-of-the-art algorithms.
机译:实体归一化(或实体链接)是信息提取的一个重要子任务,其将文本中的实体提及链接到参考词汇表中的类别或概念。基于机器学习的归一化方法具有良好的适应性,只要它们具有足够的质量,它们具有足够的训练数据。由于它们处理具有类似含义的不同表达的能力,通常使用分支表示。然而,在特定的技术和科学域中,少量培训数据和相对较小的专业化的Corpora仍然存在重大挑战。最近,基于机器学习的Contes方法已经解决了这些挑战,即在本体的参考词汇表,通常是生命科学和生物医学领域的案例。其性能取决于手动注释的语料库。此外,与其他基于机器的学习的方法一样,参数化仍然棘手。我们提出了一种新的方法来解决培训数据的稀缺,通过语料库选择,预处理和监督策略扩展符合方法,可以在没有任何手动注释的例子的情况下产生高性能结果。我们还研究哪些超参数是最具影响力的,与以前的工作相比,有时不同的模式。结果表明,我们的方法显着提高了先前最先进的算法的准确性和优异。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号