Similarity-based attribute weighting methods via clustering algorithms in the classification of imbalanced medical datasets

Polat Kemal

首页> 外文期刊>Neural computing & applications >Similarity-based attribute weighting methods via clustering algorithms in the classification of imbalanced medical datasets

【24h】

Similarity-based attribute weighting methods via clustering algorithms in the classification of imbalanced medical datasets

机译：基于相似性的属性加权方法，通过聚类算法在不平衡医疗数据集分类中

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

In the fields of pattern recognition and machine learning, the use of data preprocessing algorithms has been increasing in recent years to achieve high classification performance. In particular, it has become inevitable to use the data preprocessing method prior to classification algorithms in classifying medical datasets with the nonlinear and imbalanced data distribution. In this study, a new data preprocessing method has been proposed for the classification of Parkinson, hepatitis, Pima Indians, single proton emission computed tomography (SPECT) heart, and thoracic surgery medical datasets with the nonlinear and imbalanced data distribution. These datasets were taken from UCI machine learning repository. The proposed data preprocessing method consists of three steps. In the first step, the cluster centers of each attribute were calculated using k-means, fuzzy c-means, and mean shift clustering algorithms in medical datasets including Parkinson, hepatitis, Pima Indians, SPECT heart, and thoracic surgery medical datasets. In the second step, the absolute differences between the data in each attribute and the cluster centers are calculated, and then, the average of these differences is calculated for each attribute. In the final step, the weighting coefficients are calculated by dividing the mean value of the difference to the cluster centers, and then, weighting is performed by multiplying the obtained weight coefficients by the attribute values in the dataset. Three different attribute weighting methods have been proposed: (1) similarity-based attribute weighting in k-means clustering, (2) similarity-based attribute weighting in fuzzy c-means clustering, and (3) similarity-based attribute weighting in mean shift clustering. In this paper, we aimed to aggregate the data in each class together with the proposed attribute weighting methods and to reduce the variance value within the class. Thus, by reducing the value of variance in each class, we have put together the data in each class and at the same time, we have further increased the discrimination between the classes. To compare with other methods in the literature, the random subsampling has been used to handle the imbalanced dataset classification. After attribute weighting process, four classification algorithms including linear discriminant analysis, k-nearest neighbor classifier, support vector machine, and random forest classifier have been used to classify imbalanced medical datasets. To evaluate the performance of the proposed models, the classification accuracy, precision, recall, area under the ROC curve, kappa value, and F-measure have been used. In the training and testing of the classifier models, three different methods including the 50-50% train-test holdout, the 60-40% train-test holdout, and tenfold cross-validation have been used. The experimental results have shown that the proposed attribute weighting methods have obtained higher classification performance than random subsampling method in the handling of classifying of the imbalanced medical datasets.

机译：在模式识别和机器学习领域，近年来，使用数据预处理算法的使用越来越多，以实现高分类性能。特别是，在分类算法之前使用非线性和不平衡数据分布，可以不可避免地使用数据预处理方法。在这项研究中，已经提出了一种新的数据预处理方法，用于帕金森，肝炎，PIMA印第安人，单品质排放计算断层扫描（SPECT）心脏和胸外科医疗数据集，以及具有非线性和不平衡数据分布的胸外科医疗数据集。这些数据集是从UCI机器学习存储库中获取的。所提出的数据预处理方法包括三个步骤。在第一步中，使用K-meanson，模糊C型均值和平均移位聚类算法计算每个属性的集群中心，包括帕金森，肝炎，皮玛印第安人，SPECT心脏和胸外科医疗数据集。在第二步中，计算每个属性和群集中心之间的数据之间的绝对差异，然后，为每个属性计算这些差异的平均值。在最终步骤中，通过将差异与群集中心的平均值划分来计算加权系数，然后，通过将所获得的权重系数乘以数据集中的属性值来执行加权。已经提出了三种不同的属性加权方法：（1）基于相似性的属性加权在k-means群集中，（2）基于相似性的基于属性加权，在模糊C-meant群集中，（3）基于相似性的属性加权在平均移位中聚类。在本文中，我们旨在将每个类中的数据与所提出的属性加权方法聚合在一起，并减少类内的方差值。因此，通过减少每个类中的方差的值，我们将数据放在每个班级中，同时，我们进一步提高了类之间的歧视。要与文献中的其他方法进行比较，则会使用随机分级采样来处理不平衡的数据集分类。在属性加权过程之后，使用了四种分类算法，包括线性判别分析，k-interfall邻分类，支持向量机和随机林分类器，用于分类不平衡的医疗数据集。为了评估所提出的模型的性能，已经使用了分类准确性，精度，召回，ROC曲线，κ值和F测量下的面积。在分类器模型的培训和测试中，已经使用了三种不同的方法，包括50-50％的火车测试阻滞，60-40％的火车测试持紧仓库和十倍交叉验证。实验结果表明，在处理分类医疗数据集的处理时，所提出的属性加权方法已经获得了比随机的分类性能更高。

著录项

来源
《Neural computing & applications》 |2018年第3期|共27页
作者
Polat Kemal;
展开▼
作者单位

Abant Izzet Baysal Univ Fac Engn &

Architecture Dept Elect &

Elect Engn TR-14280 Bolu Turkey;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类人工神经网络计算机;人工智能理论;
关键词
Imbalanced medical dataset classification; Data preprocessing; Attribute weighting; Clustering algorithms;

机译：不平衡医疗数据集分类;数据预处理;属性加权;聚类算法;

相似文献

外文文献
中文文献
专利

1. Similarity-based attribute weighting methods via clustering algorithms in the classification of imbalanced medical datasets [J] . Polat Kemal Neural computing & applications . 2018,第3期

机译：基于相似性的属性加权方法，通过聚类算法在不平衡医疗数据集分类中
2. Adaptive swarm cluster-based dynamic multi-objective synthetic minority oversampling technique algorithm for tackling binary imbalanced datasets in biomedical data classification [J] . Jinyan Li, Simon Fong, Yunsick Sung, BioData Mining . 2016,第1期

机译：生物医学数据分类中基于二元不平衡数据集的自适应群聚动态多目标合成少数过采样技术算法
3. ENHANCING THE PERFORMANCE OF SMOTE ALGORITHM BY USING ATTRIBUTE WEIGHTING SCHEME AND NEW SELECTIVE SAMPLING METHOD FOR IMBALANCED DATA SET [J] . TORA FAHRUDIN, JOKO LIANTO BULIALI, CHASTINE FATICHAH International Journal of Innovative Computing Information and Control . 2019,第2期

机译：使用属性加权方案和新的选择采样方法来增强Smote算法的性能
4. dFC: A data-density-aware fuzzy clustering algorithm for imbalanced biomedical datasets [C] . Jin Wang, Lei You, Wenjie Fan, IEEE International Conference on Software Engineering and Service Science . 2017

机译：dFC：用于不平衡生物医学数据集的数据密度感知模糊聚类算法
5. Classification and Dimensional Reduction Algorithms for Very Large Biomedical Datasets [D] . Li, Huamin. 2017

机译：超大型生物医学数据集的分类和降维算法
6. Adaptive swarm cluster-based dynamic multi-objective synthetic minority oversampling technique algorithm for tackling binary imbalanced datasets in biomedical data classification [O] . Jinyan Li, Simon Fong, Yunsick Sung, 2016

机译：生物医学数据分类中基于二元不平衡数据集的自适应群聚动态多目标综合少数抽样技术算法
7. Adaptive swarm cluster-based dynamic multi-objective synthetic minority oversampling technique algorithm for tackling binary imbalanced datasets in biomedical data classification [O] . 2016

机译：生物医学数据分类中基于二元不平衡数据集的自适应群聚动态多目标综合少数抽样技术算法

Similarity-based attribute weighting methods via clustering algorithms in the classification of imbalanced medical datasets

摘要

著录项

相似文献

相关主题

期刊订阅