首页> 外文会议>International symposium on knowledge and systems sciences >Improving Precision of Inter-Document Similarity Measure by Clustering SVD
【24h】

Improving Precision of Inter-Document Similarity Measure by Clustering SVD

机译:通过聚类SVD提高文档间相似度量的精度

获取原文

摘要

Text representation, which is a fundamental and necessary step for intelligent text processing, refers to the process of determining index terms for documents and transferring the documents into numeric vectors using index terms. LSI (Latent Semantic Indexing) based on SVD (Singular Value Decomposition)is proposed to overcome the problems of polysemy and homonym in traditional lexical matching. However, it is usually criticized as with low discriminative power for representing documents although it has been validated as with good representative quality. In this paper, clustering SVD, by which SVD is conducted on text clusters not on the whole term-document matrix, is proposed to improve discriminative power of latent semantic indexing based on SVD. The key idea of clustering SVD is to cluster texts in text collection firstly and then SVD is carried out on these text clusters. We conjecture that clustering computation involved in SVD will improve statistical qualities of indexing terms produced by latent semantic indexing. A Chinese corpus and English corpus are used respectively to examine the clustering SVD method. The experiments showed that the proposed method can actually improve precision of inter-document similarity measure in comparison with classic LSI based on SVD. Moreover, more and more significance of its superior performance over LSI based on SVD turns up when less and less preservation rates for matrix approximation are set as required parameters.
机译:文本表示,这是智能文本处理的基本和必要步骤,是指使用索引术语确定文档的索引术语并将文档传送到数字向量的过程。提出了基于SVD(奇异值分解)的LSI(潜在语义索引),以克服传统词汇匹配中的多义和同声代的问题。然而,由于与良好的代表性质量验证,它通常被批评为代表文件的低鉴别权。在本文中,提出了在不在整个术语文件矩阵上的文本集群上进行SVD​​的聚类SVD,以提高基于SVD的潜在语义索引的判别力量。群集SVD的关键概念是首先在文本集合中群集文本,然后在这些文本群集中执行SVD。我们猜想SVD中涉及的聚类计算将改善潜在语义索引产生的索引项的统计质量。中文语料库和英语语料库分别用于检查群集SVD方法。实验表明,与基于SVD的经典LSI相比,该方法实际上可以提高文档间相似度量的精度。此外,基于SVD的LSI的卓越性能越来越重要,当矩阵近似的较少和较少的保存速率被设置为所需的参数时,其越来越少。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号