首页> 外文OA文献 >New Multi-Label Correlation-Based Feature Selection Methods for Multi-Label Classification and Application in Bioinformatics
【2h】

New Multi-Label Correlation-Based Feature Selection Methods for Multi-Label Classification and Application in Bioinformatics

机译:基于多标签相关性的多标签分类新特征选择方法及其在生物信息学中的应用

摘要

The very large dimensionality of real world datasets is a challenging problem for classification algorithms, since often many features are redundant or irrelevant for classification. In addition, a very large number of features leads to a high computational time for classification algorithms. Feature selection methods are used to deal with the large dimensionality of data by selecting a relevant feature subset according to an evaluation criterion. The vast majority of research on feature selection involves conventional single-label classification problems, where each instance is assigned a single class label; but there has been growing research on more complex multi-label classification problems, where each instance can be assigned multiple class labels.ududThis thesis proposes three types of new Multi-Label Correlation-based Feature Selection (ML-CFS) methods, namely: (a) methods based on hill-climbing search, (b) methods that exploit biological knowledge (still using hill-climbing search), and (c) methods based on genetic algorithms as the search method.ududFirstly, we proposed three versions of ML-CFS methods based on hill climbing search. In essence, these ML-CFS versions extend the original CFS method by extending the merit function (which evaluates candidate feature subsets) to the multi-label classification scenario, as well as modifying the merit function in other ways. A conventional search strategy, hill-climbing, was used to explore the space of candidate solutions (candidate feature subsets) for those three versions of ML-CFS. These ML-CFS versions are described in detail in Chapter 4.ududSecondly, in order to try to improve the performance of ML-CFS in cancer-related microarray gene expression datasets, we proposed three versions of the ML-CFS method that exploit biological knowledge. These ML-CFS versions are also based on hill-climbing search, but the merit function was modified in a way that favours the selection of genes (features) involved in pre-defined cancer-related pathways, as discussed in detail in Chapter 5.ududLastly, we proposed two more sophisticated versions of ML-CFS based on Genetic Algorithms (rather than hill-climbing) as the search method. The first version of GA-based ML-CFS is based on a conventional single-objective GA, where there is only one objective to be optimized; while the second version of GA-based ML-CFS performs lexicographic multi-objective optimization, where there are two objectives to be optimized, as discussed in detail in Chapter 6.ududIn this thesis, all proposed ML-CFS methods for multi-label classification problems were evaluated by measuring the predictive accuracies obtained by two well-known multi-label classification algorithms when using the selected featuresม namely: the Multi-Label K-Nearest neighbours (ML-kNN) algorithm and the Multi-Label Back Propagation Multi-Label Learning Neural Network (BPMLL) algorithm.ududIn general, the results obtained by the best version of the proposed ML-CFS methods, namely a GA-based ML-CFS method, were competitive with the results of other multi-label feature selection methods and baseline approaches. More precisely, one of our GA-based methods achieved the second best predictive accuracy out of all methods being compared (both with ML-kNN and BPMLL used as classifiers), but there was no statistically significant difference between that GA-based ML-CFS and the best method in terms of predictive accuracy. In addition, in the experiment with ML-kNN (the most accurate) method selects about twice as many features as our GA-based ML-CFS; whilst in the experiments with BPMLL the most accurate method was a baseline method that does not perform any feature selection, and runs the classifier once (with all original features) for each of the many class labels, which is a very computationally expensive baseline approach.ududIn summary, one of the proposed GA-based ML-CFS methods managed to achieve substantial data reduction, (selecting a smaller subset of relevant features) without a significant decrease in predictive accuracy with respect to the most accurate method.
机译:对于分类算法,现实世界数据集的非常大的维数是一个具有挑战性的问题,因为通常许多功能对于分类都是多余的或无关紧要的。另外,大量的特征导致分类算法的高计算时间。特征选择方法用于通过根据评估标准选择相关特征子集来处理数据的大维数。有关特征选择的绝大多数研究都涉及常规的单标签分类问题,其中为每个实例分配了一个单类标签。但是对更复杂的多标签分类问题的研究却越来越多,可以为每个实例分配多个类别标签。 ud ud本文提出了三种类型的新的基于多标签相关性的特征选择(ML-CFS)方法,即:(a)基于爬山搜索的方法,(b)利用生物学知识的方法(仍使用爬山搜索),以及(c)基于遗传算法的搜索方法。 ud ud首先,我们提出了三种基于爬山搜索的ML-CFS方法。从本质上讲,这些ML-CFS版本通过将优点函数(用于评估候选特征子集)扩展到多标签分类方案,以及以其他方式修改优点函数,从而扩展了原始CFS方法。使用传统的搜索策略“爬山”来探索这三种版本的ML-CFS的候选解决方案(候选特征子集)的空间。这些ML-CFS版本在第4章中有详细描述。 ud ud其次,为了尝试提高ML-CFS在癌症相关的微阵列基因表达数据集中的性能,我们提出了ML-CFS方法的三种版本,利用生物学知识。这些ML-CFS版本也基于爬山搜索,但是优点函数的修改方式有利于选择与预定义的癌症相关途径有关的基因(功能),如第5章中详细讨论的那样。 ud ud最后,我们提出了两种基于遗传算法(而不是爬山)的更复杂版本的ML-CFS作为搜索方法。基于GA的ML-CFS的第一版基于常规的单目标GA,其中只有一个目标需要优化;第二版基于GA的ML-CFS执行字典编目多目标优化,其中有两个目标需要优化,如第6章中详细讨论。 ud ud当使用选定特征ม时,通过测量两种著名的多标签分类算法获得的预测准确性来评估标签分类问题:即多标签K最近邻算法(ML-kNN)和多标签返回传播多标签学习神经网络(BPMLL)算法。 ud ud通常,通过建议的ML-CFS方法的最佳版本(即基于GA的ML-CFS方法)获得的结果与其他方法相比具有竞争优势。多标签特征选择方法和基线方法。更准确地说,我们的一种基于GA的方法在所有比较方法中均获得了第二好的预测准确性(两者均使用ML-kNN和BPMLL作为分类器),但基于GA的ML-CFS之间没有统计学上的显着差异以及预测准确性方面的最佳方法。此外,在ML-kNN(最准确)方法的实验中,选择的功能大约是基于GA的ML-CFS的两倍;而在使用BPMLL进行的实验中,最准确的方法是不执行任何特征选择的基线方法,并且对许多类别标签中的每一个都运行一次分类器(具有所有原始特征),这是一种在计算上非常昂贵的基线方法。总而言之,所提出的一种基于GA的ML-CFS方法设法实现了实质性的数据精简(选择了较小的相关特征子集),而相对于最准确的方法而言,预测准确性却没有明显降低。

著录项

  • 作者

    Jungjit Suwimol;

  • 作者单位
  • 年度 2016
  • 总页数
  • 原文格式 PDF
  • 正文语种 en
  • 中图分类

相似文献

  • 外文文献
  • 中文文献
  • 专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号