首页> 外文期刊>Proceedings of the Royal Society. Mathematical, physical and engineering sciences >Statistical approach to normalization of feature vectors and clustering of mixed datasets
【24h】

Statistical approach to normalization of feature vectors and clustering of mixed datasets

机译:统计特征向量标准化和混合数据集聚的方法

获取原文
获取原文并翻译 | 示例
获取外文期刊封面目录资料

摘要

Normalization of feature vectors of datasets is widely used in a number of fields of data mining, in particular in cluster analysis, where it is used to prevent features with large numerical values from dominating in distance-based objective functions. In this study, a unified statistical approach to normalization of all attributes of mixed databases, when different metrics are used for numerical and categorical data, is proposed. After the proposed normalization, the contributions of both numerical and categorical attributes to a specified objective function are statistically the same. Formulae for the statistically normalized Minkowski mixed p-metrics are given in an explicit way. It is shown that the classic z-score standardization and the min-max normalization are particular cases of the statistical normalization, when the objective function is, respectively, based on the Euclidean or the Tchebycheff (Chebyshev) metrics. Finally, clustering of several benchmark datasets is performed with non-normalized and introduced normalized mixed metrics using either the k-prototypes (for p =2) or another algorithm (for p =2).
机译:数据集特征向量的规范化被广泛用于许多数据挖掘领域,尤其是在聚类分析中,在聚类分析中,它用于防止具有较大数值的特征在基于距离的目标函数中占主导地位。在这项研究中,提出了一种统一的统计方法,当数值和分类数据使用不同的度量标准时,可以标准化混合数据库的所有属性。在建议的归一化之后,数值和分类属性对指定目标函数的贡献在统计上是相同的。以明确的方式给出了统计归一化的Minkowski混合p-度量的公式。结果表明,当目标函数分别基于欧几里得或切比雪夫(Chebyshev)度量标准时,经典z分数标准化和最小-最大标准化是统计标准化的特殊情况。最后,使用k原型(对于p = 2)或另一种算法(对于p = 2),使用非归一化和引入归一化混合度量对几个基准数据集进行聚类。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号