【24h】

Clustering Documents in a Web Directory

机译:在Web目录中群集文档

获取原文
获取原文并翻译 | 示例

摘要

Hierarchical categorization of documents is a task receiving growing interest due to the widespread proliferation of topic hierarchies for text documents. The worst problem of hierarchical supervised classifiers is their high demand in terms of labeled examples, whose amount is related to the number of topics in the taxonomy. Hence, bootstrapping a huge hierarchy with a proper set of labeled examples is a critical issue. In this paper, we propose some solutions for the bootstrapping problem, implicitly or explicitly using a taxonomy definition: a baseline approach where documents are classified according to class labels, and two clustering approaches, where training is constrained by the a-priori knowledge of the taxonomy structure, both at terminological and topo-logical level. In particular, we propose the TaxSOM model, that clusters a set of documents in a predefined hierarchy of classes, directly exploiting the knowledge of both their topological organization and their lexical description. Experimental evaluation was performed on a set of taxonomies taken from the Google Web directory.
机译:由于文本文档的主题层次结构的广泛传播,文档的层次分类是一项越来越引起人们关注的任务。分级监督分类器最严重的问题是对标记示例的高要求,其数量与分类法中主题的数量有关。因此,用一组适当的标记示例来引导巨大的层次结构是一个关键问题。在本文中,我们使用分类法定义隐式或显式地提出了自举问题的一些解决方案:一种基线方法,其中,根据类标签对文档进行分类;以及两种聚类方法,其中,训练受制于对先验知识的了解。术语和拓扑学上的分类结构。特别是,我们提出了TaxSOM模型,该模型将一组文档聚集在预定义的类层次结构中,直接利用其拓扑组织和词法描述的知识。对从Google Web目录获取的一组分类法进行了实验评估。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号