【24h】

Corpus Based Unsupervised Labeling of Documents

机译:基于语料库的文档无监督标签

获取原文
获取原文并翻译 | 示例

摘要

Text categorization involves mapping of documents to a fixed set of labels. A similar but equally important problem is that of assigning labels to large corpora. With a deluge of documents from sources like the World Wide Web, manual labeling by domain experts is prohibitively expensive. The problem of reducing effort in labeling of documents has warranted a lot of investigation in the past. Most of this work involved some kind of supervised or semi-supervised learning. This motivates the need to find automatic methods for annotating documents with labels. In this work we explore a novel method of assigning labels to documents without using any training data. The proposed method uses clustering to build semantically related sets that are used as candidate labels to documents. This technique could be used for labeling large corpora in an unattended fashion.
机译:文本分类涉及将文档映射到一组固定的标签。一个相似但同样重要的问题是为大型语料库分配标签的问题。大量的文件来自诸如Internet之类的资源,因此领域专家手动标记的费用过高。过去减少文件标签工作量的问题值得进行大量研究。大部分工作涉及某种监督或半监督学习。这激发了寻找用于使用标签注释文档的自动方法的需求。在这项工作中,我们探索了一种无需使用任何培训数据即可为文档分配标签的新颖方法。所提出的方法使用聚类来建立语义相关的集合,该集合用作文档的候选标签。此技术可以无人值守的方式用于标记大型语料库。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号