首页> 外文会议>European Conference on Principles and Practice of Knowledge Discovery in Databases(PKDD 2005); 20051003-07; Porto(PT) >Weighted Average Pointwise Mutual Information for Feature Selection in Text Categorization
【24h】

Weighted Average Pointwise Mutual Information for Feature Selection in Text Categorization

机译:文本分类中用于特征选择的加权平均逐点互信息

获取原文
获取原文并翻译 | 示例

摘要

Mutual information is a common feature score in feature selection for text categorization. Mutual information suffers from two theoretical problems: It assumes independent word variables, and longer documents are given higher weights in the estimation of the feature scores, which is in contrast to common evaluation measures that do not distinguish between long and short documents. We propose a variant of mutual information, called Weighted Average Pointwise Mutual Information (WAPMI) that avoids both problems. We provide theoretical as well as extensive empirical evidence in favor of WAPMI. Furthermore, we show that WAPMI has a nice property that other feature metrics lack, namely it allows to select the best feature set size automatically by maximizing an objective function, which can be done using a simple heuristic without resorting to costly methods like EM and model selection.
机译:互信息是用于文本分类的特征选择中的常见特征得分。互信息遭受两个理论问题:它假设独立的单词变量,并且较长的文档在特征分数的估计中具有较高的权重,这与不区分长文档和短文档的常见评估方法相反。我们提出了互信息的一种变体,称为加权平均点向互信息(WAPMI),可以避免这两个问题。我们提供支持WAPMI的理论以及广泛的经验证据。此外,我们证明WAPMI具有其他功能指标所缺乏的良好属性,即它允许通过最大化目标函数来自动选择最佳的功能集大小,这可以使用简单的启发式方法来完成,而无需诉诸诸如EM和模型之类的昂贵方法选择。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号