【24h】

Large-Scale Instance Selection using Center of Principal Components

机译:使用主组件中心的大型实例选择

获取原文

摘要

One-hot encoding is one of the most popular data preprocessing methods used to convert categorical data into numerical data before applying a training classification model. Drawbacks of one-hot encoding are its high memory consumption and large storage requirements when the training set has many categorical columns. To overcome these drawbacks, we proposed the PC-CS method to reduce the training set size by only selecting principal components center as a representative instance of each disjoint partition. The proposed method was compared with two instance selection methods and the full training model as the baseline model. We used five datasets from the UCI dataset repository website. Evaluation of the classification performance involved four classifier algorithms: decision tree, naive Bayes, support vector machine, and logistic regression. The average classification performance of the proposed PC-CS method was approximately 2% higher than those of the other algorithms tested, while the reduction rate was about 5% higher.
机译:一次性编码是用于在应用训练分类模型之前将分类数据转换为数值数据的最流行数据预处理方法之一。单热编码的缺点是其高存储器消耗和当训练集有许多分类列时的存储要求。为了克服这些缺点,我们提出了通过仅选择主成分中心作为每个不相交分区的代表性实例来减少训练集大小的PC-CS方法。将该方法与两种实例选择方法和作为基线模型的完整训练模型进行了比较。我们使用了UCI DataSet存储库网站的五个数据集。评估分类性能涉及四分类器算法:决策树,天真贝叶斯,支持向量机和逻辑回归。所提出的PC-CS方法的平均分类性能比测试的其他算法高约2%,而减少率高约5%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号