首页> 外文OA文献 >Privileged information for hierarchical document clustering: a metric learning approach
【2h】

Privileged information for hierarchical document clustering: a metric learning approach

机译:分层文档聚类的特权信息:度量学习方法

摘要

Traditional hierarchical text clustering methods assume that the documents are represented only by “technical information”, i.e., keywords, phrases, expressions and named entities that can be directly extracted from the texts. However, in many scenarios there is an additional and valuable information about the documents which is usually disregarded during the clustering task, such as user-validated tags, annotations and comments from experts, dictionaries and domain ontologies. Recently, Vapnik introduced a new learning paradigm, called LUPI - Learning Using Privileged Information, which allows the incorporation of this additional (privileged) information in a supervised learning setting. We investigated the incorporation of privileged information in unsupervised setting. The key idea in our proposed approach is to extract important relationships among documents represented in the privileged information dimensional space to learn a more accurate metric for text clustering in the technical information space. A thorough experimental evaluation indicates that the incorporation of privileged information through metric learning significantly improves the hierarchical clustering accuracy.
机译:传统的分层文本聚类方法假定文档仅由“技术信息”表示,即可以直接从文本中提取的关键字,短语,表达式和命名实体。但是,在许多情况下,存在有关文档的其他有价值的信息,而这些信息通常在聚类任务期间会被忽略,例如用户验证的标签,专家,词典和领域本体的注释和注释。最近,Vapnik引入了一种新的学习范式,称为LUPI-使用特权信息进行学习,它允许在监督学习环境中合并这些附加(特权)信息。我们调查了在无人监督的情况下特权信息的合并。我们提出的方法的关键思想是提取特权信息维度空间中表示的文档之间的重要关系,以了解用于技术信息空间中文本聚类的更准确度量。全面的实验评估表明,通过度量学习并入特权信息可以显着提高分层聚类的准确性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号