首页> 外文会议>Pacific-Asia Conference on Knowledge Discovery and Data Mining >Sequential Embedding Induced Text Clustering, a Non-parametric Bayesian Approach
【24h】

Sequential Embedding Induced Text Clustering, a Non-parametric Bayesian Approach

机译:顺序嵌入诱导文本聚类,一种非参数贝叶斯方法

获取原文

摘要

Current state-of-the-art nonparametric Bayesian text clustering methods model documents through multinomial distribution on bags of words. Although these methods can effectively utilize the word burstiness representation of documents and achieve decent performance, they do not explore the sequential information of text and relationships among synonyms. In this paper, the documents are modeled as the joint of bags of words, sequential features and word embeddings. We proposed Sequential Embedding induced Dirichlet Process Mixture Model (SiDPMM) to effectively exploit this joint document representation in text clustering. The sequential features are extracted by the encoder-decoder component. Word embeddings produced by the continuous-bag-of-words (CBOW) model are introduced to handle synonyms. Experimental results demonstrate the benefits of our model in two major aspects: (1) improved performance across multiple diverse text datasets in terms of the normalized mutual information (NMI); (2) more accurate inference of ground truth cluster numbers with regularization effect on tiny outlier clusters.
机译:当前最新的非参数贝叶斯文本聚类方法通过在单词袋上的多项式分布来对文档进行建模。尽管这些方法可以有效地利用文档的单词突发性表示并获得不错的性能,但它们并没有探索文本的顺序信息以及同义词之间的关系。在本文中,将文档建模为单词,顺序特征和单词嵌入的组合。我们提出了顺序嵌入诱导的Dirichlet过程混合模型(SiDPMM),以有效利用文本聚类中的联合文档表示。顺序特征由编码器-解码器组件提取。引入了由连续词袋(CBOW)模型产生的词嵌入来处理同义词。实验结果从两个主要方面证明了我们模型的好处:(1)在标准化互信息(NMI)方面,跨多个不同文本数据集提高了性能; (2)更精确地推断地面真相簇数,并对微小的离群簇进行正则化。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号