首页> 外文会议>IEEE Symposium Series on Computational Intelligence >A Novel Approach of Neural Topic Modelling for Document Clustering
【24h】

A Novel Approach of Neural Topic Modelling for Document Clustering

机译:用于文档聚类的神经主题建模的新方法

获取原文

摘要

Topic modelling is a text mining technique to discover common topics in a collection of documents. The proposed methodology of topic modelling used artificial neural networks to improve the clustering mechanism of similar documents by modelling probabilistic relations between the topics, documents and vocabulary. Currently, while topic modelling and clustering are considered to be manifestations of unsupervised learning, and neural networks on the other hand are used for supervised learning problems, Neural Topic Modelling reformulated topic modelling into a supervised learning task by defining an objective function whose loss function had to be minimized. Custom input embedding layers were designed in order to extract the semantic relationships between the words in the corpus, and the output of the model presented a topic probability distribution for each document. The documents with similar distributions were then bucketed together based on the criteria of meeting the threshold value of a simple distance based similarity metric, such as cosine similarity. The model was implemented using Keras with TensorFlow backend and the effectiveness of the clustering was validated on the IMDB Movie dataset and the News Aggregator dataset from UCI. On comparison with other commonly used clustering mechanisms in combination with traditional topic models, the proposed model delivered an improved Silhouette Co-efficient Score and Davies-Bouldin Index, along with an increased data handling capacity, thereby making the solution scalable.
机译:主题建模是一种文本挖掘技术,用于发现文档集中的常见主题。提出的主题建模方法使用人工神经网络通过对主题,文档和词汇之间的概率关系进行建模来改善相似文档的聚类机制。目前,虽然主题建模和聚类被认为是无监督学习的表现,而神经网络则用于有监督学习的问题,但是神经主题建模通过定义损失函数具有被最小化。设计了自定义的输入嵌入层,以提取语料库中单词之间的语义关系,并且模型的输出显示了每个文档的主题概率分布。然后,基于满足基于简单距离的相似性度量(例如余弦相似性)的阈值的标准,将具有相似分布的文档分类在一起。该模型是使用Keras与TensorFlow后端实现的,并且在来自UCI的IMDB Movie数据集和News Aggregator数据集上验证了聚类的有效性。与其他常用的与传统主题模型结合使用的聚类机制相比,所提出的模型提供了改进的Silhouette系数得分和Davies-Bouldin索引,并增加了数据处理能力,从而使解决方案具有可扩展性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号