首页> 外文会议>Workshop on graph-based methods for natural language processing 2011. >Simultaneous Similarity Learning and Feature-Weight Learning for Document Clustering
【24h】

Simultaneous Similarity Learning and Feature-Weight Learning for Document Clustering

机译:同时相似度学习和特征权重学习的文档聚类

获取原文
获取原文并翻译 | 示例

摘要

A key problem in document classification and clustering is learning the similarity between documents. Traditional approaches include estimating similarity between feature vectors of documents where the vectors are computed using TF-IDF in the bag-of-words model. However, these approaches do not work well when either similar documents do not use the same vocabulary or the feature vectors are not estimated correctly. In this paper, we represent documents and keywords using multiple layers of connected graphs. We pose the problem of simultaneously learning similarity between documents and keyword weights as an edge-weight regu-larization problem over the different layers of graphs. Unlike most feature weight learning algorithms, we propose an unsupervised algorithm in the proposed framework to simultaneously optimize similarity and the keyword weights. We extrinsically evaluate the performance of the proposed similarity measure on two different tasks, clustering and classification. The proposed similarity measure outperforms the similarity measure proposed by (Muthukrishnan et al., 2010), a state-of-the-art classification algorithm (Zhou and Burges, 2007) and three different baselines on a variety of standard, large data sets.
机译:文档分类和聚类中的关键问题是学习文档之间的相似性。传统方法包括估计文档特征向量之间的相似度,其中在词袋模型中使用TF-IDF计算向量。但是,当相似的文档未使用相同的词汇表或特征向量未正确估算时,这些方法将无法正常工作。在本文中,我们使用多层连接图表示文档和关键字。我们提出了同时学习文档和关键字权重之间的相似性的问题,作为在图的不同层上的边权重调节问题。与大多数特征权重学习算法不同,我们在提出的框架中提出了一种无监督算法,以同时优化相似度和关键字权重。我们在两个不同任务(聚类和分类)的外部评估了拟议相似性度量的性能。拟议的相似性度量优于(Muthukrishnan et al。,2010),最先进的分类算法(Zhou and Burges,2007)以及在各种标准大型数据集上的三个不同基线提出的相似性度量。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号