首页> 外文期刊>Information Processing & Management >Extended pre-processing pipeline for text classification: On the role of meta-feature representations, sparsification and selective sampling
【24h】

Extended pre-processing pipeline for text classification: On the role of meta-feature representations, sparsification and selective sampling

机译:用于文本分类的扩展预处理管道:关于元特征表示,稀疏和选择性抽样的作用

获取原文
获取原文并翻译 | 示例
           

摘要

Text Classification pipelines are a sequence of tasks needed to be performed to classify documents into a set of predefined categories. The pre-processing phase (before training) of these pipelines involve different ways of transforming and manipulating the documents for the next (learning) phase. In this paper, we introduce three new steps into the pre-processing phase of text classification pipelines to improve effectiveness while reducing the associated costs. The distance-based Meta-Features (MFs) generation step aims at reducing the dimensionality of the original term-document matrix while producing a potentially more informative space that explicitly exploits discriminative labeled information. The second step is a sparsification one aimed at making the MF representation less dense to reduce training costs and noise. The third step is a selective sampling (SS) aimed at removing lines (documents) of the matrix obtained in the previous step, by carefully selecting the "best" documents for the learning phase. Our experiments show that the proposed extended pre-processing pipeline can achieve significant gains in effectiveness when compared to the original TF-IDF (up to 52%) and embedding-based representations (up to 46%), at a much lower cost (up to 9.7x faster in some datasets). Other main contributions of our work include a thorough and rigorous evaluation of the trade-offs between cost and effectiveness associated with the introduction of these new steps into the pipeline as well as a comprehensive comparative experimental evaluation of many alternatives in terms of representations, approaches, etc.
机译:文本分类流水线是需要执行的任务序列,以将文档分类为一组预定义的类别。这些管道的预处理阶段(训练前)涉及改变和操纵下一个(学习)阶段的文档的不同方式。在本文中,我们将三个新的步骤介绍到文本分类管道的预处理阶段,以提高有效性,同时降低相关成本。基于距离的元特征(MFS)生成步骤旨在降低原始术语文件矩阵的维度,同时产生明确地利用鉴别性标记信息的潜在信息的空间。第二步是一个旨在使MF表示的稀疏致密,以降低培训成本和噪音。第三步是一种选择性采样(SS),其旨在通过仔细选择学习阶段的“最佳”文档来消除前一步骤中获得的矩阵的线条(文档)。我们的实验表明,与原始TF-IDF(高达52%)和基于嵌入的代表(高达46%)相比,建议的延长预处理管道可以在有效的有效性中获得显着的增益(高达46%),以低得多在某些数据集中快速到9.7倍)。我们工作的其他主要贡献包括彻底和严格的评估与将这些新步骤引入管道的成本和有效性之间的权衡评估,以及在陈述方面的许多替代品的全面比较实验评估,等等。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号