首页> 外文会议>International Computer Engineering Conference >A New Automated Big Data Partitioning Approach to Improve Condensation Methods Performance
【24h】

A New Automated Big Data Partitioning Approach to Improve Condensation Methods Performance

机译:一种新的自动大数据分区方法,提高冷凝方法性能

获取原文

摘要

The enormous amount of structured and unstructured data produced in many fields leads to the era of big data. These data make the existing mining algorithms ineffective to process it. Therefore, the data reduction techniques are principally utilized prior to applying data mining algorithms. The instance selection is one of the promising reduction techniques advocated to reduce the size-volume of training dataset via selecting most relevant instances. However, the traditional instance selection methods suffer from the scalability of data, due to memory limitations. Recent approaches proposed to partition the training dataset into subsets and apply instance selection methods to individual subsets. Most of these approaches are based on a random partitioning, which negatively affects the performance of the instance selection methods, especially for a high number of subsets. In this work, we propose a new partitioning approach called automated overlapped distance-based partitioning. Our approach assigns the instances to the subsets regarding the distance. The instances can be assigned to two subsets based on a defined threshold. We implement and test experimentally the proposed approach using six standard datasets and the CNN method, a standard instance-selection condensation method. The results demonstrate that our approach is better than current random approaches in terms of the reduction rate and effectiveness criteria. Moreover, our approach is able to maintain a high reduction rate and effectiveness results when the numbers of subsets is increasing.
机译:许多领域产生的大量结构化和非结构化数据导致了大数据的时代。这些数据使现有的挖掘算法无效地处理它。因此,在应用数据挖掘算法之前主要使用数据减少技术。实例选择是主张倡导的减少技术之一,通过选择大多数相关的实例来减少训练数据集的大小体积。然而,由于内存限制,传统的实例选择方法遭受数据的可扩展性。最近的方法建议将训练数据集分区成套集并将实例选择方法应用于各个子集。这些方法中的大多数是基于随机分区,这对实例选择方法的性能产生负面影响,尤其是对于大量子集。在这项工作中,我们提出了一种新的分区方法,称为自动重叠距离的距离分区。我们的方法将该实例分配给关于距离的子集。可以基于定义的阈值将该实例分配给两个子集。我们使用六个标准数据集和CNN方法实际实施和测试所提出的方法,是一种标准的实例选择冷凝方法。结果表明,在减少率和有效性标准方面,我们的方法优于当前随机方法。此外,当子集的数量增加时,我们的方法能够保持高缩小率和有效性,结果。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号