首页> 外文期刊>The Electronic Library >A paper-text perspective Studies on the influence of feature granularity for Chinese short-textclassification in the Big Data era
【24h】

A paper-text perspective Studies on the influence of feature granularity for Chinese short-textclassification in the Big Data era

机译:论文视角研究大数据时代特征粒度对中文短文本分类的影响

获取原文
获取原文并翻译 | 示例
       

摘要

Purpose - In the era of Big Data, network digital resources are growing rapidly, especially the short-text resources, such as tweets, comments, messages and so on, are showing a vigorous vitality. This study aims to compare the categories discriminative capacity (CDC) of Chinese language fragments with different granularities and to explore and verify feasibility, rationality and effectiveness of the low-granularity feature, such as Chinese characters in Chinese short-text classification (CSTC). Design/methodology/approach - This study takes discipline classification of journal articles from CSSCI as a simulation environment. On the basis of sorting out the distribution rules of classification features with various granularities, including keywords, terms and characters, the classification effects accessed by the SVM algorithm are comprehensively compared and evaluated from three angles of using the same experiment samples, testing before and after feature optimization, and introducing external data. Findings - The granularity of a classification feature has an important impact on CSTC. In general, the larger the granularity is, the better the classification result is, and vice versa. However, a low-granularity feature is also feasible, and its CDC could be improved by reasonable weight setting, even exceeding a high-granularity feature if synthetically considering classification precision, computational complexity and text coverage. Originality/value - This is the first study to propose that Chinese characters are more suitable as descriptive features in CSTC than terms and keywords and to demonstrate that CDC of Chinese character features could be strengthened by mixing frequency and position as weight.
机译:目的-在大数据时代,网络数字资源正在迅速增长,尤其是诸如推文,评论,消息等短文本资源正展现出勃勃生机。本研究旨在比较不同粒度的中文片段的判别能力(CDC),并探索和验证低粒度特征(如中文短文本分类(CSTC)中的汉字)的可行性,合理性和有效性。设计/方法/方法-本研究将CSSCI期刊文章的学科分类作为模拟环境。在梳理出关键词,术语和字符等各种粒度的分类特征分布规律的基础上,从使用相同实验样本,前后测试的三个角度,全面比较和评估了支持向量机算法访问的分类效果。功能优化,并引入外部数据。调查结果-分类功能的粒度对CSTC有重要影响。通常,粒度越大,分类结果越好,反之亦然。但是,低粒度功能也是可行的,如果综合考虑分类精度,计算复杂性和文本覆盖范围,则可以通过合理的权重设置来改善其CDC,甚至可以超过高粒度功能。原创性/价值-这是首次提出汉字比词条和关键词更适合作为CSTC中的描述性特征的研究,并证明通过混合频率和位置作为权重可以增强汉字特征的CDC。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号