首页> 外文学位 >Feature selection and extraction for text classification.
【24h】

Feature selection and extraction for text classification.

机译:用于文本分类的特征选择和提取。

获取原文
获取原文并翻译 | 示例

摘要

One of the inherent properties of the features in the text classification domain is the fact that features are redundant. In this domain, words are used as features, and since words overlap in meaning, the resulting features display some degree of redundancy. By selecting a feature set for the classification task with a lower redundancy, the same classification performance can be obtained with fewer features.; In this thesis, a feature selector (called the MIFS-C) that is derived from the mutual information feature selection (MIFS) algorithm is introduced. This algorithm requires an expression for the information that added by inclusion of a feature. This thesis provides an improvement in its formulation, such that the classification results are improved. An optimization is also presented that achieves a significant training time speedup over the original algorithm. The MIFS algorithms require an appropriate value for a redundancy parameter, however none of the previous works suggest how to select a suitable value. An algorithm to estimate an optimal value for this parameter is presented in this thesis.; Also a number of feature extraction techniques that generate more complex features such as phrases and collocations are investigated. However, these features add more redundancy to the feature set, so that a feature selection that reduces the redundancy in the feature set is required. Moreover, the overall findings are that little is gained (even with a sophisticated feature selector such as MIFS-C) by including such features in the feature set. Therefore, better results can be achieved by focusing on better feature selection (for example by using the MIFS-C algorithm) in conjunction with word only features, than focusing on extracting complicated features.
机译:文本分类域中要素的固有特性之一是要素多余。在这个领域中,单词被用作特征,并且由于单词在含义上重叠,因此得到的特征表现出一定程度的冗余。通过为冗余度较低的分类任务选择特征集,可以以较少的特征获得相同的分类性能。本文提出了一种基于互信息特征选择算法的特征选择器(MIFS-C)。该算法需要一个表达式,用于通过添加功能来添加的信息。本论文在形式上提供了改进,从而改善了分类结果。还提出了一种优化方法,该方法相对于原始算法实现了明显的训练时间加速。 MIFS算法要求冗余参数具有适当的值,但是先前的工作均未提出如何选择适当的值。本文提出了一种估计该参数最优值的算法。还研究了许多生成更复杂特征(例如短语和搭配)的特征提取技术。但是,这些功能为功能集增加了更多的冗余,因此需要进行选择以减少功能集中的冗余。此外,通过将这样的功能包括在功能集中,总体发现是很少获得的(即使使用复杂的功能选择器,例如MIFS-C)。因此,与专注于提取复杂特征相比,专注于更好的特征选择(例如,通过使用MIFS-C算法)与仅单词的特征相结合,可以获得更好的结果。

著录项

  • 作者

    Bakus, Jan.;

  • 作者单位

    University of Waterloo (Canada).;

  • 授予单位 University of Waterloo (Canada).;
  • 学科 Engineering System Science.
  • 学位 Ph.D.
  • 年度 2005
  • 页码 153 p.
  • 总页数 153
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类 系统科学;
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号