Multiple cross validation method is usually used in KNN algorithm to choose parameter K ,but it is not applicable when the size of dataset is big.Meanwhile,the most fundamental factor affecting the parameter selection is dataset itself.Therefore,we proposed an optimal K value prediction method by using the featurs of dataset itself.First is the eigenvector construction by extracting the features of historical dataset including the simple feature,statistic feature,information entropy feature,precision feature of simple algorithm,and complexity feature,etc.Then,the method employs the methods of linear regression and neural network to build a prediction model between eigenvector and optimal K value,and uses the model to predict the optimal K of new dataset.It was indicated by the experiment on UCI dataset that the method could quickly predict optimal K value and ensure certain precision.%KN N算法中的参数K的选择一般采取多次交叉验证方法求取,数据规模较大时并不适用。同时,影响参数选择最根本的因素是数据集本身。因此,提出利用数据集本身的特征预测最优K值的方法。首先提取历史数据集的简单特征、统计特征、信息熵特征、简单算法精度特征、复杂度特征等构建特征向量,然后利用线性回归、神经网络等方法建立特征向量与最优K值之间的预测模型,并用该模型预测新数据集的最优K值。在UCI数据集上的实验表明,该方法能迅速预测最优K值,并确保一定的精度。
展开▼