A Novel Approach of Neural Topic Modelling for Document Clustering

机译：用于文档聚类的神经主题建模的新方法

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Topic modelling is a text mining technique to discover common topics in a collection of documents. The proposed methodology of topic modelling used artificial neural networks to improve the clustering mechanism of similar documents by modelling probabilistic relations between the topics, documents and vocabulary. Currently, while topic modelling and clustering are considered to be manifestations of unsupervised learning, and neural networks on the other hand are used for supervised learning problems, Neural Topic Modelling reformulated topic modelling into a supervised learning task by defining an objective function whose loss function had to be minimized. Custom input embedding layers were designed in order to extract the semantic relationships between the words in the corpus, and the output of the model presented a topic probability distribution for each document. The documents with similar distributions were then bucketed together based on the criteria of meeting the threshold value of a simple distance based similarity metric, such as cosine similarity. The model was implemented using Keras with TensorFlow backend and the effectiveness of the clustering was validated on the IMDB Movie dataset and the News Aggregator dataset from UCI. On comparison with other commonly used clustering mechanisms in combination with traditional topic models, the proposed model delivered an improved Silhouette Co-efficient Score and Davies-Bouldin Index, along with an increased data handling capacity, thereby making the solution scalable.

机译：主题建模是一种文本挖掘技术，用于发现文档集中的常见主题。提出的主题建模方法使用人工神经网络通过对主题，文档和词汇之间的概率关系进行建模来改善相似文档的聚类机制。目前，虽然主题建模和聚类被认为是无监督学习的表现，而神经网络则用于有监督学习的问题，但是神经主题建模通过定义损失函数具有被最小化。设计了自定义的输入嵌入层，以提取语料库中单词之间的语义关系，并且模型的输出显示了每个文档的主题概率分布。然后，基于满足基于简单距离的相似性度量（例如余弦相似性）的阈值的标准，将具有相似分布的文档分类在一起。该模型是使用Keras与TensorFlow后端实现的，并且在来自UCI的IMDB Movie数据集和News Aggregator数据集上验证了聚类的有效性。与其他常用的与传统主题模型结合使用的聚类机制相比，所提出的模型提供了改进的Silhouette系数得分和Davies-Bouldin索引，并增加了数据处理能力，从而使解决方案具有可扩展性。

著录项

来源
《IEEE Symposium Series on Computational Intelligence》|2018年|2169-2173|共5页
会议地点
作者
Sandhya Subramani; Vaishnavi Sridhar; Kaushal Shetty;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Computational modeling; Data models; Neural networks; Computer architecture; Probabilistic logic; Task analysis; Linear programming;

机译：计算建模;数据模型;神经网络;计算机体系结构;概率逻辑;任务分析;线性编程;

相似文献

外文文献
中文文献
专利

1. Unsupervised neural networks for automatic Arabic text summarization using document clustering and topic modeling [J] . Alami Nabil, Meknassi Mohammed, En-nahnahi Noureddine, Expert systems with applications . 2021,第Juna期

机译：使用文档聚类和主题建模的自动阿拉伯文文本摘要无监督的神经网络
2. iVisClustering: An Interactive Visual Document Clustering via Topic Modeling [J] . Hanseung Lee, Jaeyeon Kihm, Jaegul Choo, Computer Graphics Forum: Journal of the European Association for Computer Graphics . 2012,第3aPta3期

机译：iVisClustering：通过主题建模的交互式可视文档聚类
3. Extracting topic-sensitive content from textual documents-A hybrid topic model approach [J] . Yan Liang, Ying Liu, Chong Chen, Engineering Applications of Artificial Intelligence . 2018,第APRa期

机译：从文本文档中提取主题敏感内容-一种混合主题模型方法
4. A Novel Approach of Neural Topic Modelling for Document Clustering [C] . Sandhya Subramani, Vaishnavi Sridhar, Kaushal Shetty IEEE Symposium Series on Computational Intelligence . 2018

机译：文档聚类的神经主题建模新方法
5. Multi-document Summarization Based on Document Clustering and Neural Sentence Fusion [D] . Fuad, Tanvir Ahmed. 2018

机译：基于文档聚类和神经句子融合的多文件摘要
6. Incorporating Statistical Topic Models in the Retrieval of Healthcare Documents [O] . Karla Caballero, Ram Akella 2015

机译：在医疗文档检索中纳入统计主题模型
7. A Statistical Approach for Multilingual Document Clustering and Topic Extraction from Clusters [O] . Silva Joaquim, Mexia Joao, Coelho Carlos A., 2004

机译：一种多语言文档聚类和从聚类中提取主题的统计方法

A Novel Approach of Neural Topic Modelling for Document Clustering

摘要

著录项

相似文献

相关主题

期刊订阅