首页> 外文会议>International Computer Engineering Conference >Record linkage approaches in big data: A state of art study
【24h】

Record linkage approaches in big data: A state of art study

机译:记录大数据中的联动方法:艺术研究

获取原文

摘要

Record Linkage aims to find records in a dataset that represent the same real-world entity across many different data sources. It is a crucial task for data quality. With the evolution of Big Data, new difficulties appeared to deal mainly with the 5Vs of Big Data properties; i.e. Volume, Variety, Velocity, Value, and Veracity. Therefore Record Linkage in Big Data is more challenging. This paper investigates ways to apply Record Linkage algorithms that handle the Volume property of Big Data. Our investigation revealed four major issues. First, the techniques used to resolve the Volume property of Big Data mainly depend on partitioning the data into a number of blocks. The processing of those blocks is parallelly distributed among many executers. Second, MapReduce is the most famous programming model that is designed for parallel processing of Big Data. Third, a blocking key is usually used for partitioning the big dataset into smaller blocks; it is often created by the concatenation of the prefixes of chosen attributes. Partitioning using a blocking key may lead to unbalancing blocks, which is known as data skew, where data is not evenly distributed among blocks. An uneven distribution of data degrades the performance of the overall execution of the MapReduce model. Fourth, to the best of our knowledge, a small number of studies has been done so far to balance the load between data blocks in a MapReduce framework. Hence more work should be dedicated to balancing the load between the distributed blocks.
机译:记录链接旨在在数据集中查找数据集中的记录,这些记录在许多不同的数据源上表示相同的真实实体。它是数据质量的重要任务。随着大数据的演变,新的困难似乎主要与5V的大数据属性交易;即卷,品种,速度,价值和准确性。因此,大数据中的记录链接更具挑战性。本文调查了应用录制链接算法的方法,该算法处理大数据的体积属性。我们的调查揭示了四个主要问题。首先,用于解析大数据的卷属性的技术主要取决于将数据分区为多个块。这些块的处理在许多执行者之间并行分布。其次,MapReduce是最着名的编程模型,专为大数据的并行处理而设计。第三,封锁密钥通常用于将大数据集分区为较小的块;它通常由所选属性的前缀的串联来创建。使用阻塞密钥的分区可能导致不平衡的块,其被称为数据偏差,其中数据在块之间均匀分布。数据的不均匀分布降低了MapReduce模型的整体执行的性能。第四,据我们所知,到目前为止已经完成了少数研究,以平衡MapReduce框架中的数据块之间的负载。因此,更多的工作应该致力于平衡分布式块之间的负载。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号