首页> 外文会议>IEEE International Conference on Bioinformatics and Bioengineering >Selecting the Appropriate Data Sampling Approach for Imbalanced and High-Dimensional Bioinformatics Datasets
【24h】

Selecting the Appropriate Data Sampling Approach for Imbalanced and High-Dimensional Bioinformatics Datasets

机译:为不平衡和高维生物信息学数据集选择适当的数据采样方法

获取原文

摘要

One of the more prevalent problems when working with bioinformatics datasets is class imbalance, when there are more instances in one class compared to the other class (es). This problem is made worse because frequently, the class of interest is also the minority class. A possible solution is data sampling, a powerful tool for combating class imbalance by adding or removing instances to make the dataset more balanced. In addition to the choice of including data sampling, one of the most important decisions when applying data sampling is what the final class ratio should be. Commonly, the final class ratio when data sampling is applied is 50:50, however it is an open question whether other ratios are more appropriate for certain imbalanced datasets (all datasets in this paper have 25.16% minority instances or less) where a 50:50 ratio requires extreme modification to the dataset. In this work we compare six different data sampling approaches (feature selection with the pair wise combinations of three data sampling techniques and two final class ratios) with feature selection without data sampling with the goal of determining if the inclusion of data sampling is beneficial and if so, what should be the final class ratio. In order to test the six data sampling approaches and feature selection alone thoroughly, we utilize seven imbalanced and high-dimensional datasets, three feature selection techniques, and six classifiers. Our results show that for a majority of scenarios, random under sampling along with either 35:65 or 50:50 is the best data sampling approach. Statistical analysis shows that there is no significant difference between the data sampling approaches. However, despite this, we still recommend using random under sampling along with 35:65 as the final class ratio. This is because of the frequency of both random under sampling and 35:65 being the most frequent top performing data sampling technique and class ratio respectively. Additionally, 35:65 will hav- fewer negative impacts than 50:50 (less data loss or over fitting, which makes it a better choice if all other factors are equal) and random under sampling is more computationally efficient than any other form of sampling, including "no sampling" (both by not requiring any internal calculations and by producing a reduced, easier-to-work-with dataset). To our knowledge, this is the most comprehensive work which focuses on the choice of the inclusion and implementation of data sampling with different final class ratios on bioinformatics datasets which exhibit such large levels of class imbalance.
机译:使用生物信息学数据集时,最普遍的问题之一是类不平衡,即一个类中的实例数量多于另一类。这个问题变得更糟,因为感兴趣的类别经常也是少数派。可能的解决方案是数据采样,这是一种通过添加或删除实例以使数据集更加平衡来解决类不平衡的强大工具。除了选择包括数据采样外,应用数据采样时最重要的决定之一就是最终的分类比率。通常,应用数据采样时的最终分类比率为50:50,但是对于某些不平衡的数据集(本文中的所有数据集具有25.16%或更少的少数实例),其他比率是否更合适是一个悬而未决的问题,其中50: 50的比率要求对数据集进行极端修改。在这项工作中,我们比较了六种不同的数据采样方法(特征选择与三种数据采样技术和两个最终分类比率的成对组合)与特征选择而不是数据采样的目的,以确定是否包含数据采样是有益的,以及是否因此,最终的班级比例应该是多少。为了单独测试这六个数据采样方法和特征选择,我们利用了七个不平衡和高维数据集,三个特征选择技术和六个分类器。我们的结果表明,在大多数情况下,随机采样以及35:65或50:50都是最好的数据采样方法。统计分析表明,数据采样方法之间没有显着差异。但是,尽管如此,我们仍然建议使用随机欠采样以及35:65作为最终分类比率。这是因为随机采样频率较低,而35:65频率分别是表现最出色的数据采样技术和分类比率。此外,与50:50相比,35:65的负面影响更少(数据丢失少或过度拟合,如果所有其他因素都相等,这是一个更好的选择),随机抽样比其他任何形式的抽样在计算效率上更高,包括“不抽样”(既不需要任何内部计算,又可以生成简化的,易于使用的数据集)。就我们所知,这是最全面的工作,重点是在表现出如此严重的类别失衡的生物信息学数据集上,选择包含和实施具有不同最终类别比率的数据采样。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号