首页> 外文会议>International Computer Engineering Conference >A New Automated Big Data Partitioning Approach to Improve Condensation Methods Performance
【24h】

A New Automated Big Data Partitioning Approach to Improve Condensation Methods Performance

机译:一种新的自动大数据分区方法,可提高压缩方法的性能

获取原文

摘要

The enormous amount of structured and unstructured data produced in many fields leads to the era of big data. These data make the existing mining algorithms ineffective to process it. Therefore, the data reduction techniques are principally utilized prior to applying data mining algorithms. The instance selection is one of the promising reduction techniques advocated to reduce the size-volume of training dataset via selecting most relevant instances. However, the traditional instance selection methods suffer from the scalability of data, due to memory limitations. Recent approaches proposed to partition the training dataset into subsets and apply instance selection methods to individual subsets. Most of these approaches are based on a random partitioning, which negatively affects the performance of the instance selection methods, especially for a high number of subsets. In this work, we propose a new partitioning approach called automated overlapped distance-based partitioning. Our approach assigns the instances to the subsets regarding the distance. The instances can be assigned to two subsets based on a defined threshold. We implement and test experimentally the proposed approach using six standard datasets and the CNN method, a standard instance-selection condensation method. The results demonstrate that our approach is better than current random approaches in terms of the reduction rate and effectiveness criteria. Moreover, our approach is able to maintain a high reduction rate and effectiveness results when the numbers of subsets is increasing.
机译:在许多领域中产生的大量结构化和非结构化数据导致了大数据时代。这些数据使现有的挖掘算法无法有效地对其进行处理。因此,在应用数据挖掘算法之前,主要利用数据缩减技术。实例选择是提倡通过选择最相关实例来减少训练数据集大小的有希望的减少技术之一。然而,由于存储器的限制,传统的实例选择方法遭受数据的可伸缩性的困扰。提出了将训练数据集划分为子集并将实例选择方法应用于各个子集的最新方法。这些方法大多数基于随机分区,这会对实例选择方法的性能产生负面影响,尤其是对于大量子集而言。在这项工作中,我们提出了一种新的分区方法,称为基于重叠距离的自动分区。我们的方法将实例分配给有关距离的子集。可以基于定义的阈值将实例分配给两个子集。我们使用六个标准数据集和CNN方法(一种标准的实例选择缩合方法)来实施和试验所提出的方法。结果表明,就减少率和有效性标准而言,我们的方法优于当前的随机方法。此外,当子集数量增加时,我们的方法能够保持较高的减少率和有效性结果。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号