首页> 外文OA文献 >Semi-random partitioning of data into training and test sets in granular computing context
【2h】

Semi-random partitioning of data into training and test sets in granular computing context

机译:在粒度计算环境中将数据半随机划分为训练集和测试集

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。
获取外文期刊封面目录资料

摘要

Due to the vast and rapid increase in the size of data, machine learning has become an increasingly more popular approach for the purpose of knowledge discovery and predictive modelling. For both of the above purposes, it is essential to have a data set partitioned into a training set and a test set. In particular, the training set is used towards learning a model and the test set is then used towards evaluating the performance of the model learned from the training set. The split of the data into the two sets, however, and the influence on model performance, has only been investigated with respect to the optimal proportion for the two sets, with no attention paid to the characteristics of the data within the training and test sets. Thus, the current practice is to randomly split the data into approximately 70% for training and 30% for testing. In this paper, we show that this way of partitioning the data leads to two major issues: (a) class imbalance and (b) sample representativeness issues. Class imbalance is known to affect the performance of many classifiers by introducing a bias towards the majority class; the representativeness of the training set affects a model’s performance through the lack of opportunity for the algorithm to learn, by not presenting it with relevant examples—similar to testing a student on material that was not taught. To solve the above two issues, we propose a semi-random data partitioning framework, in the setting of granular computing. While we discuss how the framework can address both issues, in this paper, we focus on avoiding class imbalance when partitioning the data, through the proposed approach. The results show that avoiding class imbalance results in better model performance.
机译:由于数据量的巨大且快速增长,机器学习已成为一种越来越流行的方法,用于知识发现和预测建模。为了上述两个目的,必须将数据集划分为训练集和测试集。特别地,训练集用于学习模型,然后测试集用于评估从训练集学习的模型的性能。但是,仅针对两个组的最佳比例研究了将数据分为两个组以及对模型性能的影响,而没有关注训练和测试组中数据的特征。 。因此,目前的做法是将数据随机分成大约70%的数据进行训练,将30%的数据进行测试。在本文中,我们表明,这种对数据进行分区的方式导致两个主要问题:(a)类不平衡和(b)样本代表性问题。众所周知,类别不平衡会通过偏向多数类别来影响许多分类器的性能。训练集的代表性通过缺乏算法的学习机会来影响模型的性能,方法是不提供相关示例,就像在未经授课的材料上测试学生一样。为了解决上述两个问题,我们在粒度计算的设置中提出了一种半随机数据分区框架。当我们讨论框架如何解决这两个问题时,在本文中,我们将重点通过建议的方法避免在分区数据时出现类不平衡。结果表明,避免类不平衡会导致更好的模型性能。

著录项

  • 作者

    Liu Han; Cocea Mihaela;

  • 作者单位
  • 年度 2017
  • 总页数
  • 原文格式 PDF
  • 正文语种 en
  • 中图分类

相似文献

  • 外文文献
  • 中文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号