首页> 中文期刊> 《四川大学学报(工程科学版)》 >基于ICE-LDA模型的中英文跨语言话题发现研究

基于ICE-LDA模型的中英文跨语言话题发现研究

         

摘要

With the rapid development of the Internet under the background of globalization,mining network data for cross-language texts has become one of the most popular research fields in public opinion analysis.Detecting hot topics effectively and timely for texts both in Chinese and English plays a crucial role in grasping the development of public opinion.Internet news,as an important part of the Internet public opinion,has become a significant source of information acquisition for netizens.Firstly,Internet news in Chinese and English network were collected.Secondly,the ICE-LDA model based on LDA model was proposed to detect co-occurrence topics of the mixed dataset.Then,the JS distance and cosine similarity of the topic-text distribution were used to calculate the distance between two topics in ICE-LDA model.Thirdly,a contrastive parallel corpus and a non-colligative corpus were constructed respectively for Chinese and English mixed news data.During model building,the TF-IDF algorithm was used to remove noise words of the text.Finally,two kinds of topic vectors were used to detect the co-occurrence topics.The experimental results showed that the improved topic model proposed by us can not only detect topics in the comparison corpus dataset but also in the non-comparison corpus dataset.%近年来互联网在全球化的大背景下飞速发展,针对跨语言的网络数据挖掘成为国内外舆情分析的热点问题,有效实时地检测中英文网络环境下的热点话题对舆情的掌握和舆情的发展有着至关重要的作用.网络新闻作为网络信息舆情中的重要组成部分,由于互联网的大规模普及而成为人们方便快捷获知信息的重要来源.首先,本文选择中文与英文的网络新闻作为数据源进行采集,提出了在LDA模型上改进的ICE-LDA模型进行跨英汉语言网络环境下的共现话题发现.采用话题向量化的方式,对建模产生的话题进行JS距离检测和话题文本分布相似度度量.其次,本文分别对爬虫采集到的中英混合新闻数据分别构建可对比平行语料集和非可对比语料集进行话题建模,在建模过程中利用TF-IDF算法对文档提取特征词去噪,提高话题特征表示去除无意义噪音词.最后,分别采用两种不同的话题向量化方式进行跨语言的共现话题发现建模.实验结果表明,在本文设计的爬虫采集构建的真实数据集上,改进后的话题模型不仅能够在不需要先验话题对的情况下对可对比语料集进行跨语言共现话题进行发现,而且能够对语料不平衡的情况进行共现话题发现.

著录项

相似文献

  • 中文文献
  • 外文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号