首页> 外文会议>International Conference on Algorithms and Architectures for Parallel Processing >Enriching Document Representation with the Deviations of Word Co-occurrence Frequencies
【24h】

Enriching Document Representation with the Deviations of Word Co-occurrence Frequencies

机译:通过单词共同发生频率的偏差来丰富文献表示

获取原文

摘要

Recent strategies have been proposed to reveal the semantic relatedness between documents by enriching a document with the relatedness of all the words in the given document collection to the document. By restricting the relatedness to the expected frequencies that each word will occur in the document, the traditional weighted sum of word vectors is proved to give the upper bounds of the expected frequencies. Duplicate counts usually exist during the sum of the word vectors, which weaken the discriminativeness of the enriched document vectors. The strategy which gives the lower bounds of the expected frequencies is also obtained by keeping the maximum values of the word vectors on each dimension. Together with the lower bounds and the deviations of word co-occurrence frequencies, a novel method is proposed to remove the duplicate counts existing in the upper bounds. As a result, the proposed method smooths the generated document vectors better than the weighted sum strategy. Substantial experiments verify that the document clustering incorporated with the proposed method achieves a significant performance improvement compared with the existing strategies.
机译:最近的策略已经提出通过丰富文件与文件收集中所有单词的相关性来揭示文件之间的语义相关性。通过将每个单词发生在文档中的预期频率的相关性,证明了传统的单词向量的加权和提供了预期频率的上限。在单词向量之和中通常存在重复计数,这削弱了丰富的文献向量的歧视。通过保持每个维度上的单词向量的最大值,还可以获得给出预期频率下限的策略。提出了一种新的方法,提出了一种新方法来消除上限中存在的重复计数。结果,所提出的方法比加权和策略更好地平滑所生成的文档向量。实质性实验验证了与所提出的方法的文档聚类与现有策略相比实现了显着的性能改进。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号