首页> 外文会议>European Conference on Principles of Data Mining and Knowledge Discovery >Choose Your Words Carefully: An Empirical Study of Feature Selection Metrics for Text Classification
【24h】

Choose Your Words Carefully: An Empirical Study of Feature Selection Metrics for Text Classification

机译:仔细选择您的文字:文本分类特征选择度量的实证研究

获取原文

摘要

Good feature selection is essential for text classification to make it tractable for machine learning, and to improve classification performance. This study benchmarks the performance of twelve feature selection metrics across 229 text classification problems drawn from Reuters, OHSUMED, TREC, etc. using Support Vector Machines. The results are analyzed for various objectives. For best accuracy, F-measure or recall, the findings reveal an outstanding new feature selection metric, "Bi-Normal Separation" (BNS). For precision alone, however, Information Gain (IG) was superior. A new evaluation methodology is offered that focuses on the needs of the data mining practitioner who seeks to choose one or two metrics to try that are mostly likely to have the best performance for the single dataset at hand. This analysis determined, for example, that IG and Chi-Squared have correlated failures for precision, and that IG paired with BNS is a better choice.
机译:良好的功能选择对于文本分类至关重要,以使其易于机器学习,并提高分类性能。本研究基准基准通过路透社,ohsumed,TREC等绘制的229个文本分类问题的12个特征选择度量的性能。分析了各种目标的结果。为了获得最佳准确性,F测量或回忆,调查结果显示出优秀的新功能选择度量,“双正常分离”(BNS)。然而,为了单独进行精度,信息增益(Ig)优越。提供了一种新的评估方法,专注于寻求选择一个或两个指标的数据挖掘从业者的需求,以尝试,这主要可能对手头的单个数据集具有最佳性能。例如,该分析确定了Ig和Chi和Chi方形的精度相关的失效,并且与BNS配对的IG是更好的选择。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号