首页> 外文会议> >Hidden Markov-based LDA Internet Sensitive Information Text Filtering
【24h】

Hidden Markov-based LDA Internet Sensitive Information Text Filtering

机译:基于隐马尔可夫的LDA Internet敏感信息文本过滤

获取原文

摘要

We are in an era of rapid development of Internet information [1], generating billions of text every day. Under the test of such a “digital torrent”, how to ensure the ecological security and healthy development of the Internet has become a technical challenge. Rocchio[2] put forward a linear classifier which is a classification algorithm based on linear vector space model. With the development of hardware devices, the machine learning model has become the mainstream. The main training method is linear regression [3], K-Nearest Neighbor[4], Neural Network Model[5] and Support Vector Machine[6]. This paper raises a hidden Markov model based on feature keywords-themes. This is a statistically based approach. We use the Textrank algorithm [7] to extract feature words from a large number of data sets. Using the Apriori algorithm [8] to quantify the implicit relationship between feature words, we can generate a feature word confidence level matrix and establish a HMM-LDA correlation model. According to the text document which can produce an associated state matrix and a probability state transition matrix, we can confirm the conversion probability of the visible state chain. Thus we can filter and identify sensitive Internet information text.(Abstract)
机译:我们正处于互联网信息迅速发展的时代[1],每天生成数十亿条文本。在这样的“数字洪流”的考验下,如何确保互联网的生态安全和健康发展已成为一项技术挑战。 Rocchio [2]提出了一种线性分类器,它是一种基于线性向量空间模型的分类算法。随着硬件设备的发展,机器学习模型已经成为主流。主要的训练方法是线性回归[3],K最近邻[4],神经网络模型[5]和支持向量机[6]。本文提出了一种基于特征关键词-主题的隐马尔可夫模型。这是一种基于统计的方法。我们使用Textrank算法[7]从大量数据集中提取特征词。使用Apriori算法[8]量化特征词之间的隐式关系,我们可以生成特征词置信度矩阵并建立HMM-LDA相关模型。根据可以产生关联状态矩阵和概率状态转移矩阵的文本文件,我们可以确定可见状态链的转换概率。这样我们就可以过滤和识别敏感的Internet信息文本。(摘要)

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号