首页> 外文会议>IEEE International Conference on Bioinformatics and Bioengineering >Selecting the Appropriate Data Sampling Approach for Imbalanced and High-Dimensional Bioinformatics Datasets
【24h】

Selecting the Appropriate Data Sampling Approach for Imbalanced and High-Dimensional Bioinformatics Datasets

机译:选择适当的数据采样方法,用于不平衡和高维生物信息学数据集

获取原文

摘要

One of the more prevalent problems when working with bioinformatics datasets is class imbalance, when there are more instances in one class compared to the other class (es). This problem is made worse because frequently, the class of interest is also the minority class. A possible solution is data sampling, a powerful tool for combating class imbalance by adding or removing instances to make the dataset more balanced. In addition to the choice of including data sampling, one of the most important decisions when applying data sampling is what the final class ratio should be. Commonly, the final class ratio when data sampling is applied is 50:50, however it is an open question whether other ratios are more appropriate for certain imbalanced datasets (all datasets in this paper have 25.16% minority instances or less) where a 50:50 ratio requires extreme modification to the dataset. In this work we compare six different data sampling approaches (feature selection with the pair wise combinations of three data sampling techniques and two final class ratios) with feature selection without data sampling with the goal of determining if the inclusion of data sampling is beneficial and if so, what should be the final class ratio. In order to test the six data sampling approaches and feature selection alone thoroughly, we utilize seven imbalanced and high-dimensional datasets, three feature selection techniques, and six classifiers. Our results show that for a majority of scenarios, random under sampling along with either 35:65 or 50:50 is the best data sampling approach. Statistical analysis shows that there is no significant difference between the data sampling approaches. However, despite this, we still recommend using random under sampling along with 35:65 as the final class ratio. This is because of the frequency of both random under sampling and 35:65 being the most frequent top performing data sampling technique and class ratio respectively. Additionally, 35:65 will hav- fewer negative impacts than 50:50 (less data loss or over fitting, which makes it a better choice if all other factors are equal) and random under sampling is more computationally efficient than any other form of sampling, including "no sampling" (both by not requiring any internal calculations and by producing a reduced, easier-to-work-with dataset). To our knowledge, this is the most comprehensive work which focuses on the choice of the inclusion and implementation of data sampling with different final class ratios on bioinformatics datasets which exhibit such large levels of class imbalance.
机译:当使用生物信息学数据集时,使用生物信息学数据集的问题之一是类别不平衡,与其他类相比有更多的实例,当其他类相比)。这个问题频繁地变得越来越糟,兴趣的阶级也是少数阶级。可能的解决方案是数据采样,是通过添加或删除实例来使数据集更加平衡的强大工具来调用类不平衡。除了包括数据采样的选择外,应用数据采样时最重要的决定之一是最终类别比率应该是什么。通常,应用数据采样时的最终级别比例为50:50,但是它是一个打开的问题,无论其他比率更适合某些不平衡数据集(本文的所有数据集是否有25.16%或更少的情况),其中50: 50比率需要对数据集进行极高修改。在这项工作中,我们比较六种不同的数据采样方法(特征选择,其中三个数据采样技术的三种数据采样技术和两个最终类别比率),没有具有数据采样的特征选择,其目的是确定包含数据采样是否有益,如果所以,最终比例应该是什么。为了彻底测试六个数据采样方法和特征选择,我们利用了七个不平衡和高维数据集,三个特征选择技术和六分类器。我们的结果表明,对于大多数情景,在抽样的随机性以及35:65或50:50之后是最好的数据采样方法。统计分析表明,数据采样方法之间没有显着差异。但是,尽管如此,我们仍然建议在采样下随机使用35:65作为最终类别比率。这是因为在采样下随机的频率和35:65分别是最常用的数据采样技术和类比。此外,35:65将受到50:50的负面影响(减少数据丢失或拟合,这使得如果所有其他因素相等,则在采样的随机性比任何其他形式的采样更高的计算,包括“无采样”(无需任何内部计算,并且通过生成减少,更容易与与数据集进行减少)。为了我们的知识,这是最全面的工作,专注于选择和实施具有不同最终类别比率的数据采样,并在生物信息学数据集上表现出如此大的级别不平衡。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号