【24h】

Parameterized generation of labeled datasets for text categorization based on a hierarchical directory

机译:基于分层目录的文本分类的标记数据集的参数化生成

获取原文

摘要

Although text categorization is a burgeoning area of IR research, readily available test collections in this field are surprisingly scarce. We describe a methodology and system (named ACCIO) for automatically acquiring labeled datasets for text categorization from the World Wide Web, by capitalizing on the body of knowledge encoded in the structure of existing hierarchical directories such as the Open Directory. We define parameters of categories that make it possible to acquire numerous datasets with desired properties, which in turn allow better control over categorization experiments. In particular, we develop metrics that estimate the difficulty of a dataset by examining the host directory structure. These metrics are shown to be good predictors of categorization accuracy that can be achieved on a dataset, and serve as efficient heuristics for generating datasets subject to user's requirements. A large collection of automatically generated datasets are made available for other researchers to use.
机译:尽管文本分类是IR研究的新兴领域,但令人惊讶的是,该领域中现成的测试集合非常稀缺。我们描述了一种方法和系统(称为ACCIO),该方法和系统用于通过自动利用现有分层目录(例如Open)的结构中编码的知识,从Internet自动获取标记文本的标签数据集,以便从万维网进行分类。目录。我们定义类别的参数,从而可以获取具有所需属性的众多数据集,从而可以更好地控制分类实验。特别是,我们开发了可通过检查主机目录结构来估算数据集难度的指标。这些度量标准可以很好地预测可以在数据集上实现的分类准确性,并且可以作为根据用户要求生成数据集的有效启发法。大量自动生成的数据集可供其他研究人员使用。

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号