首页> 外文期刊>Computer speech and language >A time-sensitive historical thesaurus-based semantic tagger for deep semantic annotation
【24h】

A time-sensitive historical thesaurus-based semantic tagger for deep semantic annotation

机译:基于时间敏感的历史词库的语义标记器,用于深度语义标注

获取原文
获取原文并翻译 | 示例

摘要

Automatic extraction and analysis of meaning-related information from natural language data has been an important issue in a number of research areas, such as natural language processing (NLP), text mining, corpus linguistics, and data science. An important aspect of such information extraction and analysis is the semantic annotation of language data using a semantic tagger. In practice, various semantic annotation tools have been designed to carry out different levels of semantic annotation, such as topics of documents, semantic role labeling, named entities or events. Currently, the majority of existing semantic annotation tools identify and tag partial core semantic information in language data, but they tend to be applicable only for modern language corpora. While such semantic analyzers have proven useful for various purposes, a semantic annotation tool that is capable of annotating deep semantic senses of all lexical units, or all-words tagging, is still desirable for a deep, comprehensive semantic analysis of language data. With large-scale digitization efforts underway, delivering historical corpora with texts dating from the last 400 years, a particularly challenging aspect is the need to adapt the annotation in the face of significant word meaning change over time. In this paper, we report on the development of a new semantic tagger (the Historical Thesaurus Semantic Tagger), and discuss challenging issues we faced in this work. This new semantic tagger is built on existing NLP tools and incorporates a large-scale historical English thesaurus linked to the Oxford English Dictionary. Employing contextual disambiguation algorithms, this tool is capable of annotating lexical units with a historically-valid highly fine-grained semantic categorization scheme that contains about 225,000 semantic concepts and 4,033 thematic semantic categories. In terms of novelty, it is adapted for processing historical English data, with rich information about historical usage of words and a spelling variant normalizer for historical forms of English. Furthermore, it is able to make use of knowledge about the publication date of a text to adapt its output. In our evaluation, the system achieved encouraging accuracies ranging from 77.12% to 91.08% on individual test texts. Applying time-sensitive methods improved results by as much as 3.54% and by 1.72% on average.
机译:从自然语言数据中自动提取和分析与意义相关的信息已成为许多研究领域的重要课题,例如自然语言处理(NLP),文本挖掘,语料库语言学和数据科学。这种信息提取和分析的重要方面是使用语义标记器对语言数据进行语义注释。实际上,已经设计了各种语义注释工具来执行不同级别的语义注释,例如文档主题,语义角色标签,命名实体或事件。当前,大多数现有的语义标注工具可以识别和标记语言数据中的部分核心语义信息,但是它们往往仅适用于现代语言语料库。尽管已证明这种语义分析器可用于各种目的,但是对于语言数据的深入,全面的语义分析,仍然需要能够注释所有词汇单元或所有单词标记的深层语义的语义注释工具。随着大规模数字化工作的进行,提供了可追溯到400年前的文本的历史语料库,尤其具有挑战性的方面是需要面对随着时间的推移而发生的重大词义变化来适应注释。在本文中,我们报告了新的语义标记器(“历史词库”语义标注器)的发展,并讨论了我们在这项工作中面临的挑战性问题。这个新的语义标记器建立在现有的NLP工具上,并包含与牛津英语词典链接的大规模历史英语词库。该工具使用上下文歧义消除算法,能够使用历史有效的高度细粒度语义分类方案注释词汇单元,该方案包含约225,000个语义概念和4,033个主题语义类别。在新颖性方面,它适用于处理历史英语数据,具有有关单词历史用法的丰富信息以及英语历史形式的拼写变体归一化器。此外,它能够利用有关文本出版日期的知识来调整其输出。在我们的评估中,该系统在个别测试文本上获得了令人鼓舞的准确性,范围从77.12%到91.08%。应用对时间敏感的方法可使结果平均提高3.54%和平均1.72%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号