首页> 外文期刊>Information Sciences: An International Journal >Effectively classifying short texts by structured sparse representation with dictionary filtering
【24h】

Effectively classifying short texts by structured sparse representation with dictionary filtering

机译:通过字典过滤的结构化稀疏表示有效地对短文本进行分类

获取原文
获取原文并翻译 | 示例
           

摘要

Short text classification (STC) has attracted increasing interest recently with the rapid growth of Web and social media data existing in short text form. It is a more challenging task than traditional text classification (TC) because of the feature sparsity of the processed short texts, which makes the state of the art TC approaches perform poorly on short texts if being applied straightforwardly. Existing STC approaches deal with the sparse problem mainly by enriching text content with outer corpora or additional information. Though better performance can be obtained, the performance heavily relies on the amount and quality of outer or additional information. What is worse, such outer or additional information is not always available, not to mention the high cost for acquiring such information. In this paper, we introduce a structured sparse representation classifier to effectively classify short texts, and develop an effective approach called convex hull vertices selection to reduce data correlation and redundancy of the dictionary (the set of training texts), which thus substantially boosts STC efficiency and performance. To the best of our knowledge, this is the first work that exploits structured sparsity for STC. Experiments over five datasets show that the proposed approach outperforms the state of the art TC methods in classification effectiveness and the traditional SR classifier in both classification effectiveness and classification efficiency. Furthermore, we carry out an experiment to classify short texts expanded by additional content, which indirectly shows that our approach performs better than the existing SIC methods that exploit external text sources. (C) 2015 Elsevier Inc. All rights reserved.
机译:随着以短文本形式存在的Web和社交媒体数据的快速增长,短文本分类(STC)最近引起了越来越多的兴趣。由于所处理的短文本的特征稀疏性,它比传统的文本分类(TC)更具挑战性,这使得如果直接应用TC技术,则对短文本的处理效果将很差。现有的STC方法主要通过使用外部语料库或其他信息丰富文本内容来处理稀疏问题。尽管可以获得更好的性能,但是性能在很大程度上取决于外部或附加信息的数量和质量。更糟糕的是,此类外部或附加信息并不总是可用,更不用说获取此类信息的高昂成本了。在本文中,我们引入了一种结构化的稀疏表示分类器,以有效地对短文本进行分类,并开发了一种有效的方法,即凸包顶点选择,以减少字典(训练文本集)的数据相关性和冗余性,从而大大提高了STC的效率和性能。据我们所知,这是为STC开发结构化稀疏性的第一项工作。在五个数据集上进行的实验表明,该方法在分类有效性和分类效率方面均优于现有的TC方法,在分类有效性方面优于传统的SR分类器。此外,我们进行了一个实验,对通过附加内容扩展的短文本进行分类,这间接表明我们的方法比利用现有外部文本源的现有SIC方法的性能更好。 (C)2015 Elsevier Inc.保留所有权利。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号