首页> 外文会议>European Conference on Principles and Practice of Knowledge Discovery in Databases >Weighted Average Pointwise Mutual Information for Feature Selection in Text Categorization
【24h】

Weighted Average Pointwise Mutual Information for Feature Selection in Text Categorization

机译:文本分类中的特征选择的加权平均互信息

获取原文

摘要

Mutual information is a common feature score in feature selection for text categorization. Mutual information suffers from two theoretical problems: It assumes independent word variables, and longer documents are given higher weights in the estimation of the feature scores, which is in contrast to common evaluation measures that do not distinguish between long and short documents. We propose a variant of mutual information, called Weighted Average Pointwise Mutual Information (WAPMI) that avoids both problems. We provide theoretical as well as extensive empirical evidence in favor of WAPMI. Furthermore, we show that WAPMI has a nice property that other feature metrics lack, namely it allows to select the best feature set size automatically by maximizing an objective function, which can be done using a simple heuristic without resorting to costly methods like EM and model selection.
机译:相互信息是文本分类的特征选择中的常见特征分数。相互信息遭受了两个理论问题:它假设独立的单词变量,并且在特征分数估计中赋予更高的文件,这与不区分长短文档之间的常见评估措施相反。我们提出了一种互信息的变体,称为加权平均互相信息(WAPMI),避免了两个问题。我们提供理论上的理论以及广泛的经验证据,支持WAPMI。此外,我们表明WAPMI具有良好的财产,即其他功能度量标准缺失,即它允许通过最大化目标函数自动选择最佳特征集大小,这可以使用简单的启发式来完成,而无需诉诸EM和模型等昂贵的方法选择。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号