...
首页> 外文期刊>IEEE Transactions on Knowledge and Data Engineering >On using partial supervision for text categorization
【24h】

On using partial supervision for text categorization

机译:关于使用部分监督进行文本分类

获取原文
获取原文并翻译 | 示例
   

获取外文期刊封面封底 >>

       

摘要

We discuss the merits of building text categorization systems by using supervised clustering techniques. Traditional approaches for document classification on a predefined set of classes are often unable to provide sufficient accuracy because of the difficulty of fitting a manually categorized collection of documents in a given classification model. This is especially the case for heterogeneous collections of Web documents which have varying styles, vocabulary, and authorship. Hence, we investigate the use of clustering in order to create the set of categories and its use for classification of documents. Completely unsupervised clustering has the disadvantage that it has difficulty in isolating sufficiently fine-grained classes of documents relating to a coherent subject matter. We use the information from a preexisting taxonomy in order to supervise the creation of a set of related clusters, though with some freedom in defining and creating the classes. We show that the advantage of using partially supervised clustering is that it is possible to have some control over the range of subjects that one would like the categorization system to address, but with a precise mathematical definition of how each category is defined. An extremely effective way then to categorize documents is to use this a priori knowledge of the definition of each category. We also discuss a new technique to help the classifier distinguish better among closely related clusters.
机译:我们讨论了使用监督聚类技术构建文本分类系统的优点。由于难以将人工分类的文档集合拟合到给定的分类模型中,因此在一组预定的类别上进行文档分类的传统方法通常无法提供足够的准确性。对于样式,词汇和作者身份各异的Web文档的异构集合,尤其如此。因此,我们调查了聚类的使用以创建类别集及其在文档分类中的用途。完全无监督的聚类具有以下缺点:难以隔离与一致主题相关的足够细粒度的文档类别。我们使用来自现有分类法的信息来监督一组相关聚类的创建,尽管在定义和创建类方面有一定的自由度。我们表明,使用部分监督的聚类的优势在于,可以对希望分类系统解决的主题范围进行一些控制,但要对每个类别的定义进行精确的数学定义。然后,对文档进行分类的一种非常有效的方法是利用对每个类别的定义的先验知识。我们还将讨论一种新技术,以帮助分类器更好地区分紧密相关的聚类。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号