首页> 外文会议>International Conference on Database Systems for Advanced Applications >Joint Probability Consistent Relation Analysis for Document Representation
【24h】

Joint Probability Consistent Relation Analysis for Document Representation

机译:联合概率与文件表示的一致关系分析

获取原文

摘要

Measuring the semantic similarities between documents is an important issue because it is the basis for many applications, such as document summarization, web search, text analysis, and so forth. Although many studies have explored this problem through enriching the document vectors based on the relatedness of the words involved, the performance is still far from satisfaction because of the insufficiency of data, i.e., the sparse and anomalous co-occurrences between words. The insufficient data can only generate unreliable relatedness between words. In this paper, we propose an effective approach to correct the unreliable relatedness, which keeps the joint probabilities of the co-occurrences between each word and themselves consistently equal to their occurrence probabilities throughout the generation of the relatedness. Hence the unreliable relatedness is effectively corrected by referring to the occurrence frequencies of the words, which is confirmed theoretically and experimentally. The thorough evaluation conducted on real datasets illustrates that significant improvement has been achieved on document clustering compared with the state-of-the-art methods.
机译:测量文档之间的语义相似之处是一个重要问题,因为它是许多应用程序的基础,例如文档摘要,网络搜索,文本分析等。虽然许多研究已通过丰富的基础上参与的话关联文献向量探讨过这个问题,表现仍远远满足,因为数据的不足,即稀疏,词与词之间的异常共同出现。不足的数据只能在单词之间产生不可靠的相关性。在本文中,我们提出了一种有效的方法来纠正不可靠的相关性,这使每个单词与本身之间的共同发生的联合概率始终如一地等于整个相关性的发生概率。因此,通过参考理论上和实验证实的单词的发生频率,有效地纠正了不可靠的相关性。在实时数据集上进行的彻底评估说明了与最先进的方法相比,在文件聚类上实现了显着的改进。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号