首页> 外文会议>Sixth International Conference on Semantics Knowledge and Grid >Characteristics and Uses of Labeled Datasets - ODP Case Study
【24h】

Characteristics and Uses of Labeled Datasets - ODP Case Study

机译:标记数据集的特征和用途-ODP案例研究

获取原文

摘要

Labeled datasets are essential for text categorization. They are used to train a classifier, or as a benchmark collection to evaluate categorization algorithms. However, labeling a large-scale document set is extremely expensive because it involves much human labour, and the labeling process itself is subjective rather than objective. Therefore, labels assigned to documents by only one human editor in some existing labeled document sets may be of limited use and may prove problematic for training a classifier or evaluating categorization algorithms. This research explores socially constructed Web directory, the Open Directory Project (ODP), to generate a series of labeled document sets by extracting semantic characteristics from the ODP categories which are annotated by a list of indexed Websites. The generated document sets are used to classify Web search results and the results are encouraging.
机译:标记的数据集对于文本分类至关重要。它们用于训练分类器,或用作评估分类算法的基准集合。但是,标记大型文档集非常昂贵,因为它涉及大量的人工,并且标记过程本身是主观的而不是客观的。因此,在某些现有的带标签的文档集中仅由一个人工编辑者分配给文档的标签可能用途有限,并且可能在训练分类器或评估分类算法方面存在问题。这项研究探索了社会构建的Web目录,即Open Directory Project(ODP),通过从ODP类别中提取语义特征来生成一系列带标签的文档集,这些语义特征由索引的网站列表进行了注释。生成的文档集用于对Web搜索结果进行分类,结果令人鼓舞。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号