LD A模型下不同分词方法对文本分类性能的影响研究

李湘东; 高凡; 丁丛

首页> 中文期刊> 《计算机应用研究》 >LD A模型下不同分词方法对文本分类性能的影响研究

LD A模型下不同分词方法对文本分类性能的影响研究

开具论文收录证明 >>

期刊封面封底目录下载 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

通过定义类别聚类密度、类别复杂度以及类别清晰度三个指标，从语料库信息度量的角度研究多种代表性的中文分词方法在隐含概率主题模型LDA下对文本分类性能的影响，定量、定性地分析不同分词方法在网页和学术文献等不同类型文本的语料上进行分类的适用性及影响分类性能的原因。结果表明：三项指标可以有效指明分词方法对语料在分类时产生的影响，Ik Analyzer和ICTCLAS分词法分别受类别复杂度和类别聚类密度的影响较大，二元分词法受三个指标的作用相当，使其对于不同语料具有较好的适应性。对于学术文献类型的语料，使用二元分词法时的分类效果较好，F1值均在80％以上；而网页类型的语料对于各种分词法的适应性更强。尝试通过对语料进行信息度量而非单纯的实验来选择提高该语料分类性能的最佳分词方法，以期为网页和学术文献等不同类型的文本在基于LDA模型的分类系统中选择合适的中文分词方法提供参考。%From the perspective of corpus measure,which includes three indicators:the clustering density,the complexity and definition of category,this paper studied the influence of three representative Chinese word segmentation methods,including IC-TCLAS,Ik Analyzer and 2-gram,on the performance of text classification under the implicit probabilistic topic model LDA.Mo-reover,the applicability of different Chinese word segmentation methods in different types of texts such as Web and academic documents and its cause were analyzed qualitatively and quantitatively.Experiments show that three indexes can effectively in-dicate the influence of word segmentation method on the classification of texts:Ik Analyzer and ICTCLAS segmentation method are more influenced respectively by the complexity of the category and the clustering density of the category,for 2-gram,the in-fluences of three indexes are similar,so it has good adaptability for different corpus.For corpus of academic literature,2-gram has better performance,F1 values are above 80%.And the corpus of Web pages is more adaptive to different word segmentation methods.This paper provides a reference for the selection of appropriate Chinese word segmentation method in classification system based on LDA model for different types of texts such as Web pages and academic literature by means of corpus measure instead of by experiments only.

著录项

来源
《计算机应用研究》 |2017年第1期|62-66|共5页
作者
李湘东; 高凡; 丁丛;
展开▼
作者单位

武汉大学信息管理学院;

武汉430072;

武汉大学信息资源研究中心;

武汉430072;

武汉大学信息管理学院;

武汉430072;

武汉大学信息管理学院;

武汉430072;

展开▼
原文格式 PDF
正文语种 chi
中图分类文字信息处理;
关键词
文本分类; LDA主题模型; 语料度量; 分词方法;

相似文献

中文文献
外文文献
专利

1. 融合SLDA主题模型的不均衡文本分类方法 [J] . 唐焕玲 ,刘艳红 ,郑涵 . 计算机工程与应用 . 2021,第012期
2. 有监督主题模型的SLDA-TC文本分类新方法 [J] . 唐焕玲 ,窦全胜 ,于立萍 . 电子学报 . 2019,第006期
3. 结合半监督学习和LDA模型的文本分类方法 [J] . 韩栋 ,王春华 ,肖敏 . 计算机工程与设计 . 2018,第010期
4. 基于LDA主题模型的短文本分类方法 [J] . 张志飞 ,苗夺谦 ,高灿 . 计算机应用 . 2013,第006期
5. 用于中文文本分类的基于类别区分词的特征选择方法 [J] . 周奇年 ,张振浩 ,徐登彩 . 计算机应用与软件 . 2013,第003期
6. 分词结果的再搭配对文本分类效果的增强 [C] . 侯松 ,周斌 ,贾焰 . 第24次全国计算机安全学术交流会 . 2009
7. 一种词性标注LDA模型的文本分类方法研究 [A] . 张超 . 2015

LD A模型下不同分词方法对文本分类性能的影响研究

摘要

著录项

相似文献

相关主题

期刊订阅