...
首页> 外文期刊>Journal of Parallel and Distributed Computing >Estimating record linkage costs in distributed environments
【24h】

Estimating record linkage costs in distributed environments

机译:估算分布式环境中的记录链接成本

获取原文
获取原文并翻译 | 示例
           

摘要

Record Linkage (RL) is the task of identifying duplicate entities in a dataset or multiple datasets. In the era of Big Data, this task has gained notorious attention due to the intrinsic quadratic complexity of the problem in relation to the size of the dataset. In practice, this task can be outsourced to a cloud service, and thus, a service customer may be interested in estimating the costs of a record linkage solution before executing it. Since the execution time of a record linkage solution depends on a combination of various algorithms, their respective parameter values and the employed cloud infrastructure, in practice it is hard to perform an a priori estimation of infrastructure costs for executing a record linkage task. Besides estimating customer costs, the estimation of record linkage costs is also important to evaluate whether (or not) the application of a set of RL parameter values will satisfy predefined time and budget restrictions. Aiming to tackle these challenges, we propose a theoretical model for estimating RL costs taking into account the main steps that may influence the execution time of the RL task. We also propose an algorithm, denoted as TBF, for evaluating the feasibility of RL parameter values, given a set of predefined customer restrictions. We evaluate the efficacy of the proposed model combined with regression techniques using record linkage results processed in real distributed environments. Based on the experimental results, we show that the employed regression technique has significant influence over the estimated record linkage costs. Moreover, we conclude that specific regression techniques are more suitable for estimating record linkage costs, depending on the evaluated scenario.
机译:记录链接(RL)是在数据集或多个数据集中识别重复实体的任务。在大数据的时代,由于与数据集的大小相关的问题的内在复杂性,这项任务已经获得了臭名昭着的关注。在实践中,该任务可以外包给云服务,因此,服务客户可能有兴趣在执行它之前估计记录链接解决方案的成本。由于记录链接解决方案的执行时间取决于各种算法的组合,它们各自的参数值和所采用的云基础设施,实际上很难执行用于执行记录链接任务的基础设施成本的先验估计。除了估计客户成本外,记录连杆成本的估计也很重要,评估是否存在一组RL参数值将满足预定义的时间和预算限制。旨在解决这些挑战,我们提出了一个理论模型,用于估算RL成本,考虑到可能影响RL任务的执行时间的主要步骤。我们还提出了一种表示为TBF的算法,用于评估RL参数值的可行性,因为一组预定义的客户限制。我们使用在真实分布式环境中处理的记录链接结果评估所提出的模型结合回归技术的功效。基于实验结果,我们表明,采用的回归技术对估计的记录连锁成本产生了重大影响。此外,我们得出结论,根据评估的情况,特定的回归技术更适合估计记录联动成本。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号