Text Classification Using Compression-Based Dissimilarity Measures

Coutinho David Pereira; Figueiredo Mario A. T.

首页> 外文期刊>International Journal of Pattern Recognition and Artificial Intelligence >Text Classification Using Compression-Based Dissimilarity Measures

【24h】

Text Classification Using Compression-Based Dissimilarity Measures

机译：使用基于压缩的差异度量进行文本分类

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Arguably, the most difficult task in text classification is to choose an appropriate set of features that allows machine learning algorithms to provide accurate classification. Most state-of-the-art techniques for this task involve careful feature engineering and a pre-processing stage, which may be too expensive in the emerging context of massive collections of electronic texts. In this paper, we propose efficient methods for text classification based on information-theoretic dissimilarity measures, which are used to define dissimilarity-based representations. These methods dispense with any feature design or engineering, by mapping texts into a feature space using universal dissimilarity measures; in this space, classical classifiers (e.g. nearest neighbor or support vector machines) can then be used. The reported experimental evaluation of the proposed methods, on sentiment polarity analysis and authorship attribution problems, reveals that it approximates, sometimes even outperforms previous state-of-the-art techniques, despite being much simpler, in the sense that they do not require any text pre-processing or feature engineering.

机译：可以说，文本分类中最困难的任务是选择一组适当的功能，以允许机器学习算法提供准确的分类。用于此任务的大多数最先进的技术都涉及仔细的特征工程和预处理阶段，在大量电子文本的兴起中，这可能太昂贵了。在本文中，我们提出了一种基于信息理论相异性度量的有效文本分类方法，用于定义基于相异性的表示形式。这些方法通过使用通用差异度量将文本映射到特征空间中，从而无需进行任何特征设计或工程设计。在此空间中，可以使用经典分类器（例如最近的邻居或支持向量机）。所报告方法对情感极性分析和作者归因问题的实验评估表明，该方法尽管简单得多，但在某种意义上不需要或几乎可以逼近，有时甚至优于以前的最新技术。文本预处理或特征工程。

著录项

来源
《International Journal of Pattern Recognition and Artificial Intelligence》 |2015年第5期|1553004.1-1553004.19|共19页
作者
Coutinho David Pereira; Figueiredo Mario A. T.;
展开▼
作者单位

Inst Politecn Lisboa, Inst Telecommun, P-1959007 Lisbon, Portugal|Inst Politecn Lisboa, Inst Super Engn Lisboa, P-1959007 Lisbon, Portugal;

Univ Lisbon, Inst Telecommun, P-1049001 Lisbon, Portugal|Univ Lisbon, Inst Super Tecn, P-1049001 Lisbon, Portugal;

展开▼
收录信息美国《科学引文索引》(SCI);美国《工程索引》(EI);
原文格式 PDF
正文语种 eng
中图分类
关键词
Text classidication; text similarity measures; relative entropy; Ziv-Merhav method; cross-parsing algorithm;

机译：文本分类;文本相似度;相对熵;Ziv-Merhav方法;交叉解析算法;

相似文献

外文文献
中文文献
专利

1. Phylogenetic tree building using a Novel compression-based non-symmetric dissimilarity measure [J] . R. BUSA-FEKETE, A. KOCSOR, CS. BAGYINKA Applied Ecology and Environmental Research . 2006,第2期

机译：基于新颖的基于压缩的非对称差异度量的系统进化树构建
2. Application of compression-based distance measures to protein sequence classification: a methodological study [J] . Kocsor A, Kertesz-Farkas A, Kajan L, Bioinformatics . 2006,第4期

机译：基于压缩的距离度量在蛋白质序列分类中的应用：方法学研究
3. Application of compression-based distance measures to protein sequence classification: a methodological study [J] . Kocsor A, Kertesz-Farkas A, Kajan L, Bioinformatics . 2006,第4期

机译：基于压缩的距离度量在蛋白质序列分类中的应用：方法学研究
4. A Compression-Based Dissimilarity Measure for Multi-task Clustering [C] . Nguyen Huy Thach, Hao Shao, Bin Tong, Foundations of intelligent systems . 2011

机译：基于压缩的多任务聚类测度
5. Optimized dictionary design and classification using the matching pursuits dissimilarity measure. [D] . Mazhar, Raazia. 2009

机译：使用匹配追踪异类度量优化字典设计和分类。
6. Compression-based classification of biological sequences and structures via the Universal Similarity Metric: experimental assessment [O] . Paolo Ferragina, Raffaele Giancarlo, Valentina Greco, 2007

机译：通过通用相似性度量基于压缩的生物序列和结构分类：实验评估
7. PHYLOGENETIC TREE BUILDING USING A NOVEL COMPRESSION-BASED NON-SYMMETRIC DISSIMILARITY MEASURE [O] . R. Busa-fekete, A. Kocsor, Cs. Bagyinka 2008

机译：基于新型压缩的非对称异同度量的植物学树种构建

Text Classification Using Compression-Based Dissimilarity Measures

摘要

著录项

相似文献

相关主题

期刊订阅