首页> 外文期刊>Data mining and knowledge discovery >Unsupervised meta-path selection for text similarity measure based on heterogeneous information networks
【24h】

Unsupervised meta-path selection for text similarity measure based on heterogeneous information networks

机译:基于异构信息网络的文本相似度量的无监督元路径选择

获取原文
获取原文并翻译 | 示例
           

摘要

Heterogeneous information network (HIN) is a general representation of many different applications, such as social networks, scholar networks, and knowledge networks. A key development of HIN is called PathSim based on meta-path, which measures the pairwise similarity of two entities in the HIN of the same type. When using PathSim in practice, we usually need to handcraft some meta-paths which are paths over entity types instead of entities themselves. However, finding useful meta-paths is not trivial to human. In this paper, we present an unsupervised meta-path selection approach to automatically find useful meta-paths over HIN, and then develop a new similarity measure called KnowSim which is an ensemble of selected meta-paths. To solve the high computational cost of enumerating all possible meta-paths, we propose to use an approximate personalized PageRank algorithm to find useful subgraphs to allocate the meta-paths. We apply KnowSim to text clustering and classification problems to demonstrate that unsupervised meta-path selection can help improve the clustering and classification results. We use Freebase, a well-known world knowledge base, to conduct semantic parsing and construct HIN for documents. Our experiments on 20Newsgroups and RCV1 datasets show that KnowSim results in impressive high-quality document clustering and classification performance. We also demonstrate the approximate personalized PageRank algorithm can efficiently and effectively compute the meta-path based similarity.
机译:异构信息网络(HIN)是许多不同应用的一般代表,例如社交网络,学者网络和知识网络。基于元路径的Hin的关键开发称为Pathsim,其测量相同类型的HIN中的两个实体的成对相似性。在实践中使用Pathsim时,我们通常需要手动一些元路径,这些路径是实体类型而不是实体本身的路径。然而,寻找有用的元路径并不琐碎。在本文中,我们介绍了一个无监督的元路径选择方法,以自动查找在HIN上的有用的元路径,然后开发一种名为Knowsim的新的相似性度量,这是所选元路径的集合。为了解决枚举所有可能的元路径的高计算成本,我们建议使用近似个性化PageRank算法来查找有用的子图来分配元路径。我们将知识应用于文本聚类和分类问题,以证明无监督的元路径选择可以帮助改善聚类和分类结果。我们使用FreeBase是一个着名的世界知识库,为文档进行语义解析和构建HIN。我们在20新新手组和RCV1数据集上的实验表明,知识关注导致令人印象深刻的高质量文档聚类和分类性能。我们还演示了近似个性化PageRank算法可以有效地和有效地计算基于元路径的相似性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号