Record linkage approaches in big data: A state of art study

机译：记录大数据中的联动方法：艺术研究

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Record Linkage aims to find records in a dataset that represent the same real-world entity across many different data sources. It is a crucial task for data quality. With the evolution of Big Data, new difficulties appeared to deal mainly with the 5Vs of Big Data properties; i.e. Volume, Variety, Velocity, Value, and Veracity. Therefore Record Linkage in Big Data is more challenging. This paper investigates ways to apply Record Linkage algorithms that handle the Volume property of Big Data. Our investigation revealed four major issues. First, the techniques used to resolve the Volume property of Big Data mainly depend on partitioning the data into a number of blocks. The processing of those blocks is parallelly distributed among many executers. Second, MapReduce is the most famous programming model that is designed for parallel processing of Big Data. Third, a blocking key is usually used for partitioning the big dataset into smaller blocks; it is often created by the concatenation of the prefixes of chosen attributes. Partitioning using a blocking key may lead to unbalancing blocks, which is known as data skew, where data is not evenly distributed among blocks. An uneven distribution of data degrades the performance of the overall execution of the MapReduce model. Fourth, to the best of our knowledge, a small number of studies has been done so far to balance the load between data blocks in a MapReduce framework. Hence more work should be dedicated to balancing the load between the distributed blocks.

机译：记录链接旨在在数据集中查找数据集中的记录，这些记录在许多不同的数据源上表示相同的真实实体。它是数据质量的重要任务。随着大数据的演变，新的困难似乎主要与5V的大数据属性交易;即卷，品种，速度，价值和准确性。因此，大数据中的记录链接更具挑战性。本文调查了应用录制链接算法的方法，该算法处理大数据的体积属性。我们的调查揭示了四个主要问题。首先，用于解析大数据的卷属性的技术主要取决于将数据分区为多个块。这些块的处理在许多执行者之间并行分布。其次，MapReduce是最着名的编程模型，专为大数据的并行处理而设计。第三，封锁密钥通常用于将大数据集分区为较小的块;它通常由所选属性的前缀的串联来创建。使用阻塞密钥的分区可能导致不平衡的块，其被称为数据偏差，其中数据在块之间均匀分布。数据的不均匀分布降低了MapReduce模型的整体执行的性能。第四，据我们所知，到目前为止已经完成了少数研究，以平衡MapReduce框架中的数据块之间的负载。因此，更多的工作应该致力于平衡分布式块之间的负载。

著录项

来源
《International Computer Engineering Conference》|2017年|410p|共7页
会议地点
作者
Randa M. Abd El-Ghafar; Mervat H. Gheith; Ali H. El-Bastawissy; Eman S. Nasr;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类 TP3-53;
关键词
Couplings; Big Data; Programming; Data integration; Task analysis; Data models; Databases;

机译：联轴器;大数据;编程;数据集成;任务分析;数据模型;数据库;

相似文献

外文文献
中文文献
专利

1. A validation study of a new classification algorithm to identify rheumatoid arthritis using administrative health databases: case–control and cohort diagnostic accuracy studies. Results from the RECord linkage On Rheumatic Diseases study of the Italian Society for Rheumatology [J] . Andrea Arfè, Antonella Zambon, Carlo A Scirè, BMJ Open . 2015,第1期

机译：一种使用行政健康数据库识别类风湿关节炎的新分类算法的验证研究：病例对照和队列诊断准确性研究。 RECord链接的结果意大利风湿病学会风湿病研究
2. Morbidity and doctor characteristics only partly explain the substantial healthcare expenditures of frequent attenders: a record linkage study between patient data and reimbursements data [J] . Frans T Smits, Henk J Brouwer, Aeilko H Zwinderman, BMC Family Practice . 2013,第1期

机译：发病率和医生特征仅部分解释了经常护理人员的大量医疗保健支出：患者数据与报销数据之间的记录关联研究
3. Linkage of Maternity Hospital Episode Statistics data to birth registration and notification records for births in England 2005-2014: methods. A population-based birth cohort study,Linkage of Maternity Hospital Episode Statistics data to birth registratio [J] . Nirupa Dattani, Alison Macfarlane BMJ Open . 2018,第2期

机译：产科医院情节统计数据与出生登记和英格兰2005-2014年出生通知记录的链接：方法。一项基于人口的出生队列研究，将妇产科医院情节统计数据与出生登记相关联
4. Record linkage approaches in big data: A state of art study [C] . Randa M. Abd El-Ghafar, Mervat H. Gheith, Ali H. El-Bastawissy, International Computer Engineering Conference . 2017

机译：大数据中的记录链接方法：最新研究
5. Data preparation for biomedical knowledge domain visualization: A probabilistic record linkage and information fusion approach to citation data. [D] . Synnestvedt, Marie B. 2007

机译：用于生物医学知识域可视化的数据准备：引用记录的概率记录链接和信息融合方法。
6. Changes in incidence and prognosis of ischaemic heart disease in Finland: a record linkage study of data on death certificates and hospital records for 1972 and 1981. [O] . M Koskenvuo, J Kaprio, H Langinvainio, 1985

机译：芬兰缺血性心脏病的发病率和预后的变化：1972年和1981年死亡证明和医院记录的数据的关联记录研究。
7. Comparative Study of Record Linkage Approaches for Big Data [O] . Randa MOHAMED, Ali EL-BASTAWISSY, Eman NASR, 2021

机译：基大数据记录联系方法的比较研究

Record linkage approaches in big data: A state of art study

摘要

著录项

相似文献

相关主题

期刊订阅