Extended pre-processing pipeline for text classification: On the role of meta-feature representations, sparsification and selective sampling

Washington Cunha; Sergio Canuto; Felipe Viegas; Thiago Salles; Christian Gomes; Vitor Mangaravite; Elaine Resende; Thierson Rosa; Marcos Andre Goncalves; Leonardo Rocha

首页> 外文期刊>Information Processing & Management >Extended pre-processing pipeline for text classification: On the role of meta-feature representations, sparsification and selective sampling

【24h】

Extended pre-processing pipeline for text classification: On the role of meta-feature representations, sparsification and selective sampling

机译：用于文本分类的扩展预处理管道：关于元特征表示，稀疏和选择性抽样的作用

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Text Classification pipelines are a sequence of tasks needed to be performed to classify documents into a set of predefined categories. The pre-processing phase (before training) of these pipelines involve different ways of transforming and manipulating the documents for the next (learning) phase. In this paper, we introduce three new steps into the pre-processing phase of text classification pipelines to improve effectiveness while reducing the associated costs. The distance-based Meta-Features (MFs) generation step aims at reducing the dimensionality of the original term-document matrix while producing a potentially more informative space that explicitly exploits discriminative labeled information. The second step is a sparsification one aimed at making the MF representation less dense to reduce training costs and noise. The third step is a selective sampling (SS) aimed at removing lines (documents) of the matrix obtained in the previous step, by carefully selecting the "best" documents for the learning phase. Our experiments show that the proposed extended pre-processing pipeline can achieve significant gains in effectiveness when compared to the original TF-IDF (up to 52%) and embedding-based representations (up to 46%), at a much lower cost (up to 9.7x faster in some datasets). Other main contributions of our work include a thorough and rigorous evaluation of the trade-offs between cost and effectiveness associated with the introduction of these new steps into the pipeline as well as a comprehensive comparative experimental evaluation of many alternatives in terms of representations, approaches, etc.

机译：文本分类流水线是需要执行的任务序列，以将文档分类为一组预定义的类别。这些管道的预处理阶段（训练前）涉及改变和操纵下一个（学习）阶段的文档的不同方式。在本文中，我们将三个新的步骤介绍到文本分类管道的预处理阶段，以提高有效性，同时降低相关成本。基于距离的元特征（MFS）生成步骤旨在降低原始术语文件矩阵的维度，同时产生明确地利用鉴别性标记信息的潜在信息的空间。第二步是一个旨在使MF表示的稀疏致密，以降低培训成本和噪音。第三步是一种选择性采样（SS），其旨在通过仔细选择学习阶段的“最佳”文档来消除前一步骤中获得的矩阵的线条（文档）。我们的实验表明，与原始TF-IDF（高达52％）和基于嵌入的代表（高达46％）相比，建议的延长预处理管道可以在有效的有效性中获得显着的增益（高达46％），以低得多在某些数据集中快速到9.7倍）。我们工作的其他主要贡献包括彻底和严格的评估与将这些新步骤引入管道的成本和有效性之间的权衡评估，以及在陈述方面的许多替代品的全面比较实验评估，等等。

著录项

来源
《Information Processing & Management》 |2020年第4期|102263.1-102263.25|共25页
作者
Washington Cunha; Sergio Canuto; Felipe Viegas; Thiago Salles; Christian Gomes; Vitor Mangaravite; Elaine Resende; Thierson Rosa; Marcos Andre Goncalves; Leonardo Rocha;
展开▼
作者单位

Federal University of Minas Gerais Belo Horizonte MG Brazil;

Federal University of Minas Gerais Belo Horizonte MG Brazil;

Federal University of Minas Gerais Belo Horizonte MG Brazil;

Federal University of Minas Gerais Belo Horizonte MG Brazil;

Federal University of Minas Gerais Belo Horizonte MG Brazil;

Federal University of Minas Gerais Belo Horizonte MG Brazil;

Federal University of Minas Gerais Belo Horizonte MG Brazil;

Federal University of Goias GO Brazil;

Federal University of Minas Gerais Belo Horizonte MG Brazil;

Federal University of Sao Joao Del Rei So Joao Del Rei MG Brazil;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Text classification pipelines; Pre-processing; Meta-features; Selective sampling; Sparsification; Experimental evaluation;

机译：文本分类管道;预处理;元特征;选择性抽样;稀疏;实验评价;

相似文献

外文文献
中文文献
专利

1. A Thorough Evaluation of Distance-Based Meta-Features for Automated Text Classification [J] . Sérgio Canuto, Daniel Xavier Sousa, Marcos André Gonçalves, IEEE Transactions on Knowledge and Data Engineering . 2018,第12期

机译：对基于距离的元特征进行自动文本分类的全面评估
2. Combining supervised term-weighting metrics for SVM text classification with extended term representation [J] . Haddoud Mounia, Mokhtari Aicha, Lecroq Thierry, Knowledge and information systems . 2016,第3期

机译：将用于SVM文本分类的监督术语权重度量与扩展术语表示相结合
3. Boosting Naive Bayes text classification using uncertainty-based selective sampling [J] . Han-Joon Kim, Je-Uk Kim, Young-Gook Ra Neurocomputing . 2005,第Aug期

机译：使用基于不确定性的选择性采样提高朴素贝叶斯文本分类
4. Text Sparsification via Local Maxima Extended Abstract [C] . Pilu Crescenzi, Alberto Del Lungo, Roberto Grossi, Conference on foundations of software technology and theoretical computer science . 2000

机译：通过本地最大值扩展摘要文本稀疏
5. Sparsification, Sampling, and System Identification in Extended Dynamic Mode Decomposition [D] . Boddupalli, Nibodh. 2020

机译：扩展动态模式分解中的稀疏，采样和系统识别
6. Analyzing the Moving Parts of a Large-Scale Multi-Label Text Classification Pipeline: Experiences in Indexing Biomedical Articles [O] . Anthony Rios, Ramakanth Kavuluru -1

机译：分析大型多标签文本分类管道的运动部分：生物医学文章索引的经验
7. I-ATAC: interactive pipeline for the management and pre-processing of ATAC-seq samples [O] . Zeeshan Ahmed, Duygu Ucar 2017

机译：I-aTaC：用于管理和预处理aTaC-seq样本的交互式管道

Extended pre-processing pipeline for text classification: On the role of meta-feature representations, sparsification and selective sampling

摘要

著录项

相似文献

相关主题

期刊订阅