Instance reduction for supervised learning using input-output clustering method

YODJAIPHET Anusorn; THEERA-UMPON Nipon; AUEPHANWIRIYAKUL Sansanee

摘要

A method that applies clustering technique to reduce the number of samples of large data sets using input−output clustering is proposed. The proposed method clusters the output data into groups and clusters the input data in accordance with the groups of output data. Then, a set of prototypes are selected from the clustered input data. The inessential data can be ultimately discarded from the data set. The proposed method can reduce the effect from outliers because only the prototypes are used. This method is applied to reduce the data set in regression problems. Two standard synthetic data sets and three standard real-world data sets are used for evaluation. The root-mean-square errors are compared from support vector regression models trained with the original data sets and the corresponding instance-reduced data sets. From the experiments, the proposed method provides good results on the reduction and the reconstruction of the standard synthetic and real-world data sets. The numbers of instances of the synthetic data sets are decreased by 25%−69%. The reduction rates for the real-world data sets of the automobile miles per gallon and the 1990 census in CA are 46% and 57%, respectively. The reduction rate of 96% is very good for the electrocardiogram (ECG) data set because of the redundant and periodic nature of ECG signals. For all of the data sets, the regression results are similar to those from the corresponding original data sets. Therefore, the regression performance of the proposed method is good while only a fraction of the data is needed in the training process.

机译：提出了一种应用聚类技术使用输入输出聚类减少大数据集样本数量的方法。所提出的方法将输出数据聚类为组，并根据输出数据的组聚类为输入数据。然后，从聚类的输入数据中选择一组原型。无关紧要的数据最终可以从数据集中丢弃。所提出的方法可以减少离群值的影响，因为仅使用了原型。该方法适用于减少回归问题中的数据集。使用两个标准的综合数据集和三个标准的实际数据集进行评估。从使用原始数据集和相应的实例精简数据集训练的支持向量回归模型中比较均方根误差。从实验中，所提出的方法在标准合成和真实数据集的减少和重构方面提供了良好的结果。综合数据集的实例数量减少了25％-69％。现实世界中每加仑汽车行驶里程和1990年美国普查的数据减少率分别为46％和57％。对于心电图（ECG）数据集，由于ECG信号具有冗余性和周期性，因此96％的降低率非常好。对于所有数据集，回归结果与来自相应原始数据集的回归结果相似。因此，该方法的回归性能良好，而在训练过程中只需要一小部分数据。

Instance reduction for supervised learning using input-output clustering method

摘要

著录项

相关主题

期刊订阅