首页> 外文期刊>Machine Learning >Similarity encoding for learning with dirty categorical variables
【24h】

Similarity encoding for learning with dirty categorical variables

机译:用于使用肮脏的分类变量进行学习的相似编码

获取原文
获取原文并翻译 | 示例

摘要

For statistical learning, categorical variables in a table are usually considered as discrete entities and encoded separately to feature vectors, e.g., with one-hot encoding. “Dirty” non-curated data give rise to categorical variables with a very high cardinality but redundancy: several categories reflect the same entity. In databases, this issue is typically solved with a deduplication step. We show that a simple approach that exposes the redundancy to the learning algorithm brings significant gains. We study a generalization of one-hot encoding, similarity encoding , that builds feature vectors from similarities across categories. We perform a thorough empirical validation on non-curated tables, a problem seldom studied in machine learning. Results on seven real-world datasets show that similarity encoding brings significant gains in predictive performance in comparison with known encoding methods for categories or strings, notably one-hot encoding and bag of character n-grams. We draw practical recommendations for encoding dirty categories: 3-gram similarity appears to be a good choice to capture morphological resemblance. For very high-cardinalities, dimensionality reduction significantly reduces the computational cost with little loss in performance: random projections or choosing a subset of prototype categories still outperform classic encoding approaches.
机译:为了进行统计学习,通常将表中的类别变量视为离散实体,并分别编码为特征向量,例如使用一热编码。 “肮脏”的非整理数据会产生具有非常高的基数但具有冗余性的类别变量:多个类别反映了同一实体。在数据库中,通常可以通过重复数据删除步骤解决此问题。我们表明,将冗余暴露给学习算法的简单方法带来了显着的收益。我们研究了一种热编码的通用性,即相似性编码,它从类别之间的相似性构建特征向量。我们对非策展的表进行了全面的经验验证,这是机器学习中很少研究的问题。在七个真实世界的数据集上的结果表明,与已知的类别或字符串编码方法相比,相似性编码在预测性能上有显着提高,尤其是一字编码和字符n-gram袋。我们提出了对脏类别进行编码的实用建议:3克相似度似乎是捕获形态相似的好选择。对于非常高的基数,降维可显着降低计算成本,而性能几乎不受影响:随机投影或选择原型类别的子集仍胜过传统编码方法。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号