...
首页> 外文期刊>International Journal of Pattern Recognition and Artificial Intelligence >Text Classification Using Compression-Based Dissimilarity Measures
【24h】

Text Classification Using Compression-Based Dissimilarity Measures

机译:使用基于压缩的差异度量进行文本分类

获取原文
获取原文并翻译 | 示例
           

摘要

Arguably, the most difficult task in text classification is to choose an appropriate set of features that allows machine learning algorithms to provide accurate classification. Most state-of-the-art techniques for this task involve careful feature engineering and a pre-processing stage, which may be too expensive in the emerging context of massive collections of electronic texts. In this paper, we propose efficient methods for text classification based on information-theoretic dissimilarity measures, which are used to define dissimilarity-based representations. These methods dispense with any feature design or engineering, by mapping texts into a feature space using universal dissimilarity measures; in this space, classical classifiers (e.g. nearest neighbor or support vector machines) can then be used. The reported experimental evaluation of the proposed methods, on sentiment polarity analysis and authorship attribution problems, reveals that it approximates, sometimes even outperforms previous state-of-the-art techniques, despite being much simpler, in the sense that they do not require any text pre-processing or feature engineering.
机译:可以说,文本分类中最困难的任务是选择一组适当的功能,以允许机器学习算法提供准确的分类。用于此任务的大多数最先进的技术都涉及仔细的特征工程和预处理阶段,在大量电子文本的兴起中,这可能太昂贵了。在本文中,我们提出了一种基于信息理论相异性度量的有效文本分类方法,用于定义基于相异性的表示形式。这些方法通过使用通用差异度量将文本映射到特征空间中,从而无需进行任何特征设计或工程设计。在此空间中,可以使用经典分类器(例如最近的邻居或支持向量机)。所报告方法对情感极性分析和作者归因问题的实验评估表明,该方法尽管简单得多,但在某种意义上不需要或几乎可以逼近,有时甚至优于以前的最新技术。文本预处理或特征工程。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号