首页> 外文期刊>Computer Science and Information Technology >Feature Selection in Sparse Matrices
【24h】

Feature Selection in Sparse Matrices

机译:稀疏矩阵中的特征选择

获取原文
       

摘要

Feature selection, as a pre-processing step to machine learning, is effective in reducing dimensionality, removing irrelevant data, increasing learning accuracy, and improving result comprehensibility. There are two main approaches for feature selection: wrapper methods, in which the features are selected using the supervised learning algorithm, and filter methods, in which the selection of features is independent of any learning algorithm. However, most of these techniques use feature scoring algorithms that make some basic assumptions about the distribution of the data like normality, balanced distribution of classes, non-sparsity or dense data-set, etc. The data generated in the real world rarely follow such strict criteria. In some cases such as digital advertising, the generated data matrix is actually very sparse and follows no distinct distribution. For this reason, we have come up with a new approach towards feature selection for cases where the data-sets do not follow the above-mentioned assumptions. Our methodology also presents an approach to solve the problem of skewness of data. The efficiency and effectiveness of our methods is then demonstrated by comparison with other well-known techniques of statistics like ANOVA, mutual information, KL divergence, Fisher score, Bayes' error, Chi-square, etc. The data-set used for validation is a real-world user-browsing history data-set used for ad-campaign targeting. It has very high dimensions and is highly sparse as well. Our approach reduces the number of features to a significant degree without compromising on the accuracy of the final predictions.
机译:特征选择作为机器学习的预处理步骤,可有效减少维度,删除不相关数据,提高学习准确性并提高结果的可理解性。有两种主要的特征选择方法:包装器方法(其中使用监督学习算法选择特征)和过滤器方法(其中特征选择独立于任何学习算法)。但是,大多数这些技术使用特征评分算法,这些算法对数据的分布做出一些基本假设,例如正态性,类的平衡分布,非稀疏性或密集数据集等。在现实世界中生成的数据很少遵循这样的假设严格的标准。在某些情况下,例如数字广告,生成的数据矩阵实际上非常稀疏,并且没有明显的分布。因此,对于数据集不遵循上述假设的情况,我们提出了一种新的特征选择方法。我们的方法还提出了一种解决数据偏度问题的方法。然后,通过与其他众所周知的统计技术(如方差分析,互信息,KL散度,Fisher得分,贝叶斯误差,卡方)等进行比较,证明了我们方法的效率和有效性。用于验证的数据集为用于广告系列定位的真实用户浏览历史记录数据集。它具有很高的尺寸,也非常稀疏。我们的方法在不影响最终预测准确性的情况下,在很大程度上减少了特征数量。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号