Imbalanced data exists widely in the real world, and its classification is a hot topic in machine learning. Most traditional classification algorithms assume balanced class distribution or equal misclassification costs, while they do not work when dealing with the imbalanced data. On the one hand, an imbalanced data classification algorithm, named as PCBoost, is proposed in this paper. The algorithm constructs decision tree with information gain ratio as the splitting criterion, and regards the decision tree as a weak classifier. At the beginning of each iteration, the algorithm makes use of data synthesize method to add synthetic minority class examples in order to balance training information. After the sub-classifier is formed, the algorithm corrects the perturbation and deletes the synthetic examples that are not correctly classified. On the other hand, the data synthesize method is discussed, the theoretical analysis of training error boundary is put forward, and the choice of ensemble learning parameters is analyzed. The experimental results show that the PCBoost algorithm has advantages on imbalanced data classification problem.%现实世界中广泛存在不平衡数据,其分类问题是机器学习研究中的一个热点.多数传统分类算法假定类分布平衡或误分类代价均衡,在处理不平衡数据时,效果不够理想.文中提出一种不平衡数据分类算法-PCBoost.算法以信息增益率为分裂准则构建决策树,作为弱分类器.在每次迭代初始,利用数据合成方法添加合成的少数类样例,平衡训练信息;在子分类器形成后,修正“扰动”,删除未被正确分类的合成样例.文中讨论了数据合成方法,给出了训练误差界的理论分析,并分析了集成学习参数的选择.实验结果表明,PCBoost算法具有处理不平衡数据分类问题的优势.
展开▼