首页> 外文期刊>Journal of intelligent & fuzzy systems: Applications in Engineering and Technology >Entity resolution framework using rough set blocking for heterogeneous web of data
【24h】

Entity resolution framework using rough set blocking for heterogeneous web of data

机译:使用粗糙集阻塞的实体分辨率框架用于异构数据

获取原文
获取原文并翻译 | 示例
           

摘要

Entity Resolution (ER) is the method of resolving two similar entities used in the process of data cleaning and data integration. However, existing ER Framework lead to exhaustive pairwise comparisons. The most efficient ER method is blocking, inherently uses exponential pair-wise comparisons for the large databases, leading to poor efficiency in resolving the entities. The real world data can either be homogeneous or heterogeneous, generally of two forms, clean-clean ER which does not have any duplicates or dirty-ER which have duplicates within the dataset. Entity Resolution framework is associated with two phases namely the block building phase which construct the blocks where the similar entities are grouped into a single block for effective indexing, while the aim of block processing phase is to reduce the number of redundant pair-wise comparisons. Another perspective is handling of the entity associated with heterogeneous data, in the proposed work the block building phase aims to gather related entities with different representations into a single block with an approximation space. For this purpose semantic-dominance rough set has been used to cluster the attributes of related entities having a varied schema. The similarity between the entities associated with the clustered attributes is determined using a rough-Jaccard similarity measure, grouped to form blocks of varied, but limited size. The pair-wise comparisons between the blocks of entities are carried out only when the lower approximation of the blocks are same, determined by the proposed multi-criteria Pareto optimality, else the entities are not compared, which signifies, the overall number of pair-wise comparisons is reduced. A performance analysis of the proposed technique has been tested on four real-world, highly heterogeneous datasets, and the validation of these algorithms has yielded 99.98% effectiveness and 98.3% efficiency in block comparison when compared to token blocking and attribute clustering methods.
机译:实体分辨率(ER)是解析数据清洁和数据集成过程中使用的两个类似实体的方法。然而,现有的ER框架导致详尽的成对比较。最有效的ER方法是阻塞的,本身地使用对大型数据库的指数对比较,从而导致解决实体的效率差。现实世界数据可以是同质的或异构的,通常是两个形式,清洁清洁ER,其在数据集中没有重复的任何重复或脏-er。实体分辨率框架与两个阶段相关联,即块构建阶段,该块构建阶段构造与类似实体被分组成单个块的块,用于有效索引,而块处理阶段的目的是减少冗余配对比较的数量。另一个透视是处理与异构数据相关联的实体,在所提出的工作中,块构建阶段旨在将具有不同表示的相关实体与具有近似空间的单个块收集到单个块中。为此目的,语义主导地位粗糙集已用于聚类具有变化模式的相关实体的属性。使用粗略Jaccard相似度测量确定与群集属性相关联的实体之间的相似性,分组以形成各种变化但有限但大小的块。实体块之间的成对比较仅在块的较低近似相同时执行,由所提出的多标准Pareto最优值确定实体,否则该实体比较,这意味着,对的总数 - 明智的比较减少了。在四个现实世界,高度异构的数据集中测试了该技术的性能分析,与令牌阻塞和属性聚类方法相比,这些算法的验证产生了99.98%的有效性和98.3%的效率。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号