首页> 外文会议>Workshop of the European Group for Intelligent Computing in Engineering >Unsupervised Named Entity Normalization for Supporting Information Fusion for Big Bridge Data Analytics
【24h】

Unsupervised Named Entity Normalization for Supporting Information Fusion for Big Bridge Data Analytics

机译:无监督的命名实体归一化,用于支持大桥数据分析的信息融合

获取原文
获取外文期刊封面目录资料

摘要

The large amount of multi-type and multi-source bridge data open unprecedented opportunities to big data analytics for better bridge deterioration prediction. Information fusion is needed prior to the analytics to transform the heterogeneous data from different sources into a unified representation. Resolving the ambiguities in the named entities extracted from bridge inspection reports is one of the most important fusion tasks. The ambiguity stems from the use of different and ambiguous surface forms to the same target named entity. There is, thus, a need for named entity normalization (NEN) methods that can map these ambiguous surface forms into their canonical form - an identifier concept. However, existing NEN methods are limited in this regard. This is because they mostly require pre-established knowledge (e.g., dictionaries or Wikipedia) and/or training data, and mostly ignore the impact of the normalization on data analytics. To address this need, this paper proposes an unsupervised NEN method. It includes two main components: candidate identifier concept generation based on multi-grams of each named entity set, and candidate identifier concept ranking based on a proposed ranking function. The function uses the TF-IDF (term frequency-inverse document frequency) weight and is further improved by considering the impacts of gram lengths and positions on the ranking. It aims to balance the abstractness and detailedness of the identifier concepts, so as to ensure that the resulting data are neither too dense nor too sparse for the analytics. A set of experiments were conducted to evaluate the performance of the proposed method. It achieved an accuracy of 84.5%.
机译:大量多型和多源网桥数据对大数据分析开放了前所未有的机会,以便更好的桥梁劣化预测。在分析之前需要信息融合,以将异构数据从不同来源转换为统一的表示。解决从桥接检查报告中提取的命名实体中的含糊之处是最重要的融合任务之一。歧义源于使用不同和模糊的表面形式的不同目标。因此,需要对可以将这些模糊的表面形成为其规范形式的命名实体归一化(NEN)方法 - 标识符概念。然而,在这方面存在现有的NEN方法是有限的。这是因为它们主要需要预先建立的知识(例如,词典或维基百科)和/或培训数据,并且大多数忽略了对数据分析的标准化的影响。为了解决这种需求,本文提出了一种无人监督的NEN方法。它包括两个主要组成部分:基于每个命名实体集的多克的候选标识符概念生成,以及基于所提出的排名函数的候选标识符概念概念排序。该功能使用TF-IDF(术语频率逆文档频率)重量,并且通过考虑克长度和位置对排名的影响进一步提高。它旨在平衡标识符概念的抽象和详细性,从而确保所产生的数据既不太密集也不太稀疏,因为分析。进行了一组实验以评估所提出的方法的性能。它达到了84.5%的准确性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号