数据挖掘中所获取的数据维数多,常常导致数据存储所需容量大,知识挖掘所需时间长,预测正确率不高等问题,特征选择是解决上述问题的重要方法之一.针对现有特征选择算法最佳特征个数难以确定及分类准确率有待进一步提高等问题,提出一种同时考虑相关性和冗余度的多准则赋权排序的算法(mCRC),mCRC结合两种准则同时对特征进行排序,并利用C-SVM对按重要性降序排好的特征采用顺序前向浮动搜索得出最佳特征子集.实验结果表明,mCRC算法与单独基于互信息或类别可分性赋权排序的特征选择方法相比能在更短的时间内获得分类性能更好的最佳特征子集,为快速并高效地对数据集进行挖掘提供了有力保障.%Large-scaled and multi-dimension in data mining which may increase the storage of system, lead to the waste of time and low-accuracy precision. A new feature selection approach mCRC based on multi-criterion ranking and C-SVM is introduced in this paper towards the defects of present feature selection such as low accuracy and undeterminable amounts of optimal features. mCRC computes the dependencies between features and labels through mutual information and class distances, meanwhile deletes irrelevant and completely redundant attributes, and then all features based on dependencies and redundancies are ranked. For the sake of obtaining the optimal subset, it uses C-SVM classifiers to filtrate the sorted features via Sequential Forward Floating Selection(SFFS). Compared with traditional algorithm, mCRC algorithm achieves high accuracy with less training time, it's able to provide a strong guarantee for quick and efficient massive data mining.
展开▼