...
首页> 外文期刊>IEICE Transactions on Information and Systems >Generating Category Hierarchy for Classifying Large Corpora
【24h】

Generating Category Hierarchy for Classifying Large Corpora

机译:生成用于分类大型语料库的类别层次结构

获取原文
获取原文并翻译 | 示例
           

摘要

We address the problem of dealing with large collections of data, and investigate the use of automatically constructing domain specific category hierarchies to improve text classification. We use two well-known techniques, the partitioning clustering method called it-means and loss function, to create the category hierarchy. The it-means method involves iterating through the data that the system is permitted to classify during each iteration and construction of a hierarchical structure. In general, the number of clusters k is not given beforehand. Therefore, we used a loss function that measures the degree of disappointment in any differences between the true distribution over inputs and the learner's prediction to select the appropriate number of clusters k. Once the optimal number of k is selected, the procedure is repeated for each cluster. Our evaluation using the 1996 Reuters corpus, which consists of 806,791 documents, showed that automatically constructing hierarchies improves classification accuracy.
机译:我们解决了处理大量数据的问题,并研究了使用自动构建领域特定类别层次结构来改善文本分类的问题。我们使用两种众所周知的技术(称为it-means的分区聚类方法和损失函数)来创建类别层次结构。 it-means方法涉及迭代数据,该数据允许系统在每次迭代和层次结构的构造期间进行分类。通常,不预先给出簇数k。因此,我们使用损失函数来衡量失望的程度,这些失望程度是输入的真实分布与学习者的预测之间的任何差异,以选择合适的聚类数k。一旦选择了最佳的k数,将对每个群集重复该过程。我们使用1996年的Reuters语料库(包含806,791个文档)进行的评估显示,自动构建层次结构可提高分类的准确性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号