首页> 外文期刊>Indian Journal of Science and Technology >A Semantic Deduplication of Temporal Dynamic Records from Multiple Web Databases
【24h】

A Semantic Deduplication of Temporal Dynamic Records from Multiple Web Databases

机译:来自多个Web数据库的时间动态记录的语义重复数据删除

获取原文
           

摘要

Objective: The main objective of this paper is to improve the true positive level of record deduplication using Ontology based MHMM-Fuzzy clustering approach. Methods/Statistical Analysis: Most of the record deduplication system in literature used genetic programming based record deduplication which combined different pieces of evidence extracted from the data content. However the accuracy of the system is low. To overcome this problem we propose a Multiple Hidden Markov Model (MHMM) which is used to increase the accuracy and also to identify joint duplicate records. In this model, if the database has multiple columns, it performs the deduplication for the all columns which will degrade the performance of the system. So to solve this problem, MHMM-Fuzzy Clustering based record deduplication is introduced. In this system Fuzzy clustering is performed through multiple observations from the Hidden Markov Model. Then duplicate data are grouped into one cluster according to their fuzzy logic and it can be eliminated easily. However the true positive level of the system is low. To improve the true positive level Fuzzy Ontology based semantic similarity is incorporated in MHMM-Fuzzy Clustering approach. This implies the improvement of the true positive level of the model. Thus it increases the efficiency of deduplication function that identifies the records of replica and duplications. Findings: Multiple Hidden Markov Model (MHMM) based record deduplication, MHMM-Fuzzy clustering based record deduplication and Ontology based MHMM-Fuzzy clustering approach are applied on Cora Bibliographic dataset and Restaurants dataset. The performance measures are evaluated in terms of precision, recall, f-measure, Execution time and accuracy results. Applications/Improvements: Thus the current research achieves improved result on record deduplication is better than previous works in terms of precision, recall, f-measure, Execution time and accuracy results.
机译:目的:本文的主要目的是使用基于本体的MHMM-Fuzzy聚类方法提高记录重复数据删除的真实水平。方法/统计分析:文献中的大多数记录重复数据删除系统都使用基于遗传编程的记录重复数据删除技术,该方法结合了从数据内容中提取的不同证据。但是,系统的精度较低。为克服此问题,我们提出了一种多重隐马尔可夫模型(MHMM),该模型用于提高准确性并标识联合重复记录。在此模型中,如果数据库具有多个列,则它将对所有列执行重复数据删除,这将降低系统性能。因此,为了解决这个问题,引入了基于MHMM-模糊聚类的记录重复数据删除技术。在这个系统中,模糊聚类是通过对隐马尔可夫模型的多次观察来进行的。然后根据重复数据的模糊逻辑将重复数据分组到一个群集中,可以轻松地将其消除。但是,系统的真正积极水平很低。为了提高真正的积极水平,在MHMM-模糊聚类方法中引入了基于模糊本体的语义相似性。这意味着模型真实正水平的提高。因此,它提高了重复数据删除功能的效率,该功能可识别副本和重复记录。研究结果:将基于多重隐马尔可夫模型(MHMM)的记录重复数据删除,基于MHMM-Fuzzy聚类的记录重复数据删除和基于本体的MHMM-Fuzzy聚类方法应用于Cora书目数据集和Restaurants数据集。根据精度,召回率,f度量,执行时间和准确性结果对性能度量进行评估。应用/改进:因此,当前的研究在记录重复数据删除方面取得了改进的结果,在准确性,查全率,f量度,执行时间和准确性结果方面比以前的工作要好。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号