首页> 外文会议>International Conference on Cloud Computing and Security >An Improved Data Cleaning Algorithm Based on SNM
【24h】

An Improved Data Cleaning Algorithm Based on SNM

机译:一种基于SNM的改进的数据清理算法

获取原文

摘要

The basic sorted-neighborhood method (SNM) is a classic algorithm to detect approximately duplicate records in data cleaning, but the drawback is that the size of sliding window is hard to select and the attribute matching is too frequent so the detection efficiency is unfavorable. An optimized algorithm is proposed based on SNM By setting the size and speed of the sliding window variable to avoid missing record comparisons and reduce unnecessary ones, also it uses cosine similarity algorithm in attribute matching to improve precision of detection, and the Top-k effective weight filtering algorithm is proposed to reduce the number of attribute matching and improve the detection efficiency. The experiment results show that the improved algorithm is better than SNM in recall rate, precision rate and execution time efficiency.
机译:基本的排序邻域方法(SNM)是一种用于在数据清理中检测近似重复记录的经典算法,但缺点是滑动窗口的大小难以选择且属性匹配过于频繁,因此检测效率不佳。提出了一种基于SNM的优化算法,通过设置滑动窗口变量的大小和速度来避免丢失记录比较并减少不必要的记录比较,并且在属性匹配中使用余弦相似度算法来提高检测精度,并且Top-k有效。提出了一种加权过滤算法,以减少属性匹配的次数,提高检测效率。实验结果表明,改进的算法在召回率,准确率和执行时间效率上均优于SNM。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号