首页> 外文会议>Advanced data mining and applications >A Hybrid Statistical Data Pre-processing Approach for Language-Independent Text Classification
【24h】

A Hybrid Statistical Data Pre-processing Approach for Language-Independent Text Classification

机译:与语言无关的文本分类的混合统计数据预处理方法

获取原文
获取原文并翻译 | 示例

摘要

Data pre-processing is an important topic in Text Classification (TC). It aims to convert the original textual data in a data-mining-ready structure, where the most significant text-features that serve to differentiate between text-categories are identified. Broadly speaking, textual data pre-processing techniques can be divided into three groups: (ⅰ) linguistic, (ⅱ) statistical, and (ⅲ) hybrid (ⅰ) & (ⅱ). With regard to language-independent TC, our study relates to the statistical aspect only. The nature of textual data pre-processing includes: Document-base Representation (DR) and Feature Selection (FS). In this paper, we propose a hybrid statistical FS approach that integrates two existing (statistical FS) techniques, DIAAF (Darmstadt Indexing Approach Association Factor) and GSSC (Galavotti·Sebastiani·Simi Coefficient). Our proposed approach is presented under a statistical "bag of phrases" DR setting. The experimental results, based on the well-established associative text classification approach, demonstrate that our proposed technique outperforms existing mechanisms with respect to the accuracy of classification.
机译:数据预处理是文本分类(TC)中的重要主题。它旨在将原始文本数据转换为可用于数据挖掘的结构,在此结构中,将识别出区分文本类别的最重要的文本功能。广义上讲,文本数据预处理技术可以分为三类:(ⅰ)语言,(ⅱ)统计和(ⅲ)混合(ⅰ)和(ⅱ)。关于与语言无关的TC,我们的研究仅涉及统计方面。文本数据预处理的性质包括:基于文档的表示(DR)和功能选择(FS)。在本文中,我们提出了一种混合统计FS方法,该方法集成了两种现有的(统计FS)技术,即DIAAF(达姆施塔特索引方法关联因子)和GSSC(Galavotti·Sebastiani·Simi系数)。我们提出的方法是在统计“短语袋” DR设置下提出的。基于完善的关联文本分类方法的实验结果表明,在分类的准确性方面,我们提出的技术优于现有的机制。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号