Nearest-neighbor classification with categorical variables

Samuel E. Buttrey

首页> 外文期刊>Computational statistics & data analysis >Nearest-neighbor classification with categorical variables

【24h】

Nearest-neighbor classification with categorical variables

机译：带分类变量的最近邻分类

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

A technique is presented for adopting nearest-neighbor classification to the case of categorical variables. The set of categories is mapped onto the real line in such a way as to maximize the ratio of total sum of squares to within-class sum of squares, aggregated over classes. The resulting real values then replace the categories, and nearest-neighbor classification proceeds with the Euclidean metric on these new values. Continuous variables can be included in this scheme with little added efort. This approach has been implemented in a computer program and tried on a number of data sets, with encouraging results. Nearest-neighbor classification is a well-known and efective classification technique. With this scheme, an unknown item's distances to all known items are measured, and the unknown class is estimated by the class of the nearest neighbor or by the class most often represented among a set of nearest neighbors. This has proven effective in many examples, but an appropriate distance normalization is required when variables are scaled differently. For categorical variables "distance" is not even defined. In this paper categorical data values are replaced by real numbers in an optimal way: then those real numbers are used in nearest-neighbor classification.

机译：提出了一种在分类变量情况下采用最近邻分类的技术。类别集合以最大化平方总和与类内平方和之比的方式映射到实线上，该总和随类累加。然后，所得的实际值将替换类别，并且在这些新值上使用欧几里得度量进行最近邻分类。连续变量可以包含在此方案中，而无需付出太多努力。该方法已在计算机程序中实现，并尝试了许多数据集，并获得了令人鼓舞的结果。最近邻分类是一种众所周知的有效分类技术。使用此方案，可以测量未知物品到所有已知物品的距离，并通过最近邻居的类别或一组最近邻居中最经常表示的类别来估计未知类别。在许多示例中，这被证明是有效的，但是当变量的缩放比例不同时，需要适当的距离归一化。对于分类变量，甚至没有定义“距离”。在本文中，分类数据值以最佳方式被实数替换：然后将这些实数用于最近邻居分类。

著录项

来源
《Computational statistics & data analysis》 |1998年第2期|共13页
作者
Samuel E. Buttrey;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类计算技术、计算机技术;
关键词
optimal scaling; cross-validation; fisher's criterion; choice of metric;

机译：最优缩放;交叉验证;费舍尔准则;度量选择;

相似文献

外文文献
中文文献
专利

1. Nearest-neighbor classification with categorical variables [J] . Samuel E. Buttrey Computational statistics & data analysis . 1998,第2期

机译：带分类变量的最近邻分类
2. Statistics for Categorical Surveys—A New Strategy for Multivariate Classification and Determining Variable Importance [J] . Alexander Herr Sustainability . 2010,第2期

机译：分类调查的统计数据—多元分类和确定变量重要性的新策略
3. Regularized classification for mixed continuous and categorical variables under across-location heteroscedasticity [J] . Leung CY Journal of Multivariate Analysis: An International Journal . 2005,第2期

机译：跨位置异方差下连续和分类混合变量的正则分类
4. ZigZag, a New Clustering Algorithm to Analyze Categorical Variable Cross- Classification Tables [C] . Stephane Lallich European conference on principles of data mining and knowledge discovery . 1999

机译：Zigzag，一种新的聚类算法来分析分类可变交叉分类表
5. EFFECTS OF CATEGORICAL STRUCTURE ON SPONTANEOUS ACCESS AND UTILIZATION OF INFORMATION IN A CLASSIFICATION TASK. [D] . YEARWOOD, AMY ALISON. 1987

机译：分类任务中分类结构对自发访问和信息利用的影响。
6. Exploring dependence between categorical variables: Benefits and limitations of using variable selection within Bayesian clustering in relation to log-linear modelling with interaction terms [O] . Michail Papathomas, Sylvia Richardson -1

机译：探索类别变量之间的依存关系：在贝叶斯聚类中使用变量选择相对于具有交互项的对数线性建模的好处和局限性
7. Statistics for Categorical Surveys—A New Strategy for Multivariate Classification and Determining Variable Importance [O] . Alexander Herr 2010

机译：分类调查的统计数据—多元分类和确定变量重要性的新策略

Nearest-neighbor classification with categorical variables

摘要

著录项

相似文献

相关主题

期刊订阅