首页> 外文会议>Multi-disciplinary international workshop on artificial intelligence >BoWT: A Hybrid Text Representation Model for Improving Text Categorization Based on AdaBoost.MH
【24h】

BoWT: A Hybrid Text Representation Model for Improving Text Categorization Based on AdaBoost.MH

机译:BoWT:一种基于AdaBoost.MH的用于改进文本分类的混合文本表示模型

获取原文

摘要

Text representation is the fundamental task in text categorization system. The Bag-of-Words (BoW) is a typical model for representing the texts into vectors of single words. Even though it is a simple representation model, BoW has been criticized for its disregard of the relationships between the words. Alternatively, the Latent Dirichlet Allocation (LDA) topic model has been proposed to represent the texts into a Bag-of-Topics (BoT). In LDA, the words in the corpus are statistically grouped into a small number of themes called "latent topics" in which the topics capture the semantic relationships between the words. Thus, representing the documents using BoT will dramatically accelerate the training time; as well improve the classification performance. However, BoT has been proven to not be effective for unbalanced datasets. Accordingly, this paper presents a hybrid text representation model as a combination of BoW and BoT, namely BoWT. In BoWT, the high weighted BoW's features are merged with the BoT's features to produce a new feature space. The proposed representation model BoWT is evaluated for multi-label text categorization based on the well-known boosting algorithm AdaBoost.MH. The experimental results on four benchmarks demonstrated that the BoWT representation model notably outperforms both BoW and BoT and dramatically improves the classification performance of AdaBoost.MH for text categorization.
机译:文本表示是文本分类系统中的基本任务。单词袋(BoW)是一种典型的模型,用于将文本表示为单个单词的向量。尽管BoW是一个简单的表示模型,但由于忽略单词之间的关系而受到批评。另外,已经提出了潜在狄利克雷分配(LDA)主题模型,以将文本表示为主题袋(BoT)。在LDA中,语料库中的单词在统计上被分组为称为“潜在主题”的少量主题,其中主题捕获了单词之间的语义关系。因此,使用BoT代表文档将大大缩短培训时间;以及提高分类性能。但是,已证明BoT对于不平衡的数据集无效。因此,本文提出了一种结合BoW和BoT的混合文本表示模型,即BoWT。在BoWT中,高权重BoW的特征与BoT的特征合并以产生新的特征空间。基于著名的增强算法AdaBoost.MH,对提出的表示模型BoWT进行了多标签文本分类的评估。在四个基准上的实验结果表明,BoWT表示模型明显优于BoW和BoT,并且显着提高了AdaBoost.MH在文本分类方面的分类性能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号