首页> 外文会议>Advanced data mining and applications >A Hybrid Statistical Data Pre-processing Approach for Language-Independent Text Classification

【24h】

A Hybrid Statistical Data Pre-processing Approach for Language-Independent Text Classification

机译：与语言无关的文本分类的混合统计数据预处理方法

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

Data pre-processing is an important topic in Text Classification (TC). It aims to convert the original textual data in a data-mining-ready structure, where the most significant text-features that serve to differentiate between text-categories are identified. Broadly speaking, textual data pre-processing techniques can be divided into three groups: (ⅰ) linguistic, (ⅱ) statistical, and (ⅲ) hybrid (ⅰ) & (ⅱ). With regard to language-independent TC, our study relates to the statistical aspect only. The nature of textual data pre-processing includes: Document-base Representation (DR) and Feature Selection (FS). In this paper, we propose a hybrid statistical FS approach that integrates two existing (statistical FS) techniques, DIAAF (Darmstadt Indexing Approach Association Factor) and GSSC (Galavotti·Sebastiani·Simi Coefficient). Our proposed approach is presented under a statistical "bag of phrases" DR setting. The experimental results, based on the well-established associative text classification approach, demonstrate that our proposed technique outperforms existing mechanisms with respect to the accuracy of classification.

机译：数据预处理是文本分类（TC）中的重要主题。它旨在将原始文本数据转换为可用于数据挖掘的结构，在此结构中，将识别出区分文本类别的最重要的文本功能。广义上讲，文本数据预处理技术可以分为三类：（ⅰ）语言，（ⅱ）统计和（ⅲ）混合（ⅰ）和（ⅱ）。关于与语言无关的TC，我们的研究仅涉及统计方面。文本数据预处理的性质包括：基于文档的表示（DR）和功能选择（FS）。在本文中，我们提出了一种混合统计FS方法，该方法集成了两种现有的（统计FS）技术，即DIAAF（达姆施塔特索引方法关联因子）和GSSC（Galavotti·Sebastiani·Simi系数）。我们提出的方法是在统计“短语袋” DR设置下提出的。基于完善的关联文本分类方法的实验结果表明，在分类的准确性方面，我们提出的技术优于现有的机制。

著录项

来源
《Advanced data mining and applications》|2009年|338-349|共12页
会议地点 Beijing(CN);Beijing(CN)
作者
Yanbo J. Wang; Frans Coenen; Robert Sanderson;
展开▼
作者单位

Information Management Center, China Minsheng Banking Corp., Ltd.Room 606, Building No. 8, 1 Zhongguancun Nandajie,100873 Beijing, China;

Department of Computer Science, University of Liverpool,Ashton Building, Ashton Street, Liverpool, L69 3BX, UK;

Department of Computer Science, University of Liverpool,Ashton Building, Ashton Street, Liverpool, L69 3BX, UK;

展开▼
会议组织
原文格式 PDF
正文语种 eng
中图分类 TP311.13;
关键词
associative classification; data pre-processing; document-base representation; feature selection; (language-independent) text classification;

机译：关联分类数据预处理；基于文档的表示形式；特征选择；（与语言无关）文本分类;

相似文献

外文文献
中文文献
专利

1. Identification of representative buildings and building groups in urban datasets using a novel pre-processing, classification, clustering and predictive modelling approach [J] . Tardioli Giovanni, Kerrigan Ruth, Oates Mike, Building and Environment . 2018,第AUGa期

机译：使用新颖的预处理，分类，聚类和预测建模方法识别城市数据集中的代表性建筑物和建筑物组
2. Hybrid denoising-jittering data pre-processing approach to enhance multi-step-ahead rainfall-runoff modeling [J] . Nourani Vahid, Partoviyan Afshin Stochastic environmental research and risk assessment . 2018,第2期

机译：混合去噪抖动数据预处理方法，以增强多步提前降雨径流建模
3. Mining and Tracking Massive Text Data: Classification, Construction of Tracking Statistics, and Inference Under Misclassification [J] . Daniel R. JESKE, Regina Y. LIU Technometrics . 2007,第2期

机译：大量文本数据的挖掘和跟踪：分类，跟踪统计信息的构建以及分类错误的推理
4. A Hybrid Statistical Data Pre-processing Approach for Language-Independent Text Classification [C] . Yanbo J. Wang, Frans Coenen, Robert Sanderson International Conference on Advanced Data Mining and Applications . 2009

机译：语言无关文本分类的混合统计数据预处理方法
5. A Generative Statistical Approach for Data Classification in a Biologically Inspired Design Tool [D] . Arroyo Rujano, Marvin. 2018

机译：生物启发设计工具中数据分类的生成统计方法
6. A Hybrid Approach for Biomarker Discovery from Microarray Gene Expression Data for Cancer Classification [O] . Yanxiong Peng, Wenyuan Li, Ying Liu 2006

机译：从微阵列基因表达数据中发现生物标志物的混合方法用于癌症分类
7. A Hybrid Statistical Data Pre-processing Approach for Language-Independent Text Classification [O] . Yanbo J. Wang, Frans Coenen, Robert Sanderson 2009

机译：一种用于语言无关文本分类的混合统计数据预处理方法
8. Preliminary Statistical Investigation into the Impace of an N-Gram Analysis Approach Based on World Syntactic Categories Toward Text Author Classification [R] . Diab, M. , Schuster, J. , Bock, P. 2000

机译：基于世界句法范畴对文本作者分类的N-gram分析方法的初步统计研究

A Hybrid Statistical Data Pre-processing Approach for Language-Independent Text Classification

摘要

著录项

相似文献

相关主题

期刊订阅