【24h】

Text Mining with Constrained Tensor Decomposition

机译:具有约束张量分解的文本挖掘

获取原文

摘要

Text mining, as a special case of data mining, refers to the estimation of knowledge or parameters necessary for certain purposes, such as unsupervised clustering by observing various documents. In this context, the topic of a document can be seen as a hidden variable, and words are multi-view variables related to each other by a topic. The main goal in this paper is to estimate the probability of topics, and conditional probability of words given topics. To this end, we use non negative Canonical Polyadic (CP) decomposition of a third order moment tensor of observed words. Our computer simulations show that the proposed algorithm has better performance compared to a previously proposed algorithm, which utilizes the Robust tensor power method after whitening by second order moment. Moreover, as our cost function includes the non negativity constraint on estimated probabilities, we never obtain negative values in our estimated probabilities, whereas it is often the case with the power method combined with deflation. In addition, our algorithm is capable of handling over-complete cases, where the number of hidden variables is larger than that of multi-view variables, contrary to deflation-based techniques. Further, the method proposed therein supports a larger over-completeness compared to modified versions of the tensor power method, which has been customized to handle over-complete case.
机译:作为数据挖掘的特殊情况,文本挖掘是指某些目的所需的知识或参数,例如通过观察各种文件,例如无监督的聚类。在此上下文中,文档的主题可以被视为隐藏变量,并且单词是主题彼此相关的多视图变量。本文的主要目标是估计主题的概率,以及给出主题的单词的条件概率。为此,我们使用非负规范多adic(CP)分解观察单词的三阶时刻张量。我们的计算机模拟表明,与先前提出的算法相比,该算法具有更好的性能,该算法利用鲁棒张力功率方法在二次订单时瞬间。此外,由于我们的成本函数包括对估计概率的非负面约束,因此我们从未获得过估计概率的负值,而通常情况下电源方法与放气相结合的情况。此外,我们的算法能够处理完整的情况,其中隐藏变量的数量大于多视图变量,与基于通缩的技术相反。此外,与张测电力方法的修改版本相比,其中提出的方法支持更大的过完整性,这已经定制以处理过度完整的情况。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号