首页> 外文会议>International conference on advances in computing, communications and informatics >Detection of a new class in a huge corpus of text documents through semi-supervised learning
【24h】

Detection of a new class in a huge corpus of text documents through semi-supervised learning

机译:通过半监督学习检测大量文本文档中的新类

获取原文

摘要

This paper poses a new problem of detecting an unknown class present in a text corpus which has huge amount of unlabeled samples but a very small quantity of labeled samples. A simple yet efficient solution has also been proposed by modifying conventional clustering technique to demonstrate the scope of the problem for further research. A novel way to estimate cluster diameter is proposed which in turn has been used as a measure to estimate the degree of dissimilarity between two clusters. The main idea of the model is to arrive at a cluster of unlabeled text samples which is far away from any of the labeled clusters guided by few rules such as diameter of the cluster and dissimilarity between pair of clusters. This work is first of its kind in the literature and has tremendous applications in text mining tasks. In fact the model proposed is a general framework which can be applied onto any application which necessarily involves identification of unseen classes in a semi-supervised learning environment. The model has been studied with extensive empirical analysis on different text datasets created from the benchmarking 20Newsgroups dataset. The results of the experimentation have revealed the capabilities of the proposed approach and the possibilities for future research.
机译:本文提出了一个新的问题,即检测文本语料库中存在的未知类,该类具有大量未标记的样本,但标记的样本却非常少。通过修改常规聚类技术,还提出了一种简单而有效的解决方案,以证明问题的范围有待进一步研究。提出了一种估计簇直径的新颖方法,该方法又被用作一种估计两个簇之间不相似程度的方法。该模型的主要思想是得到一个未标记文本样本的群集,该群集与受少数规则(例如群集的直径和群集对之间的不相似)指导的任何标记群集相距很远。这项工作在文献中尚属首次,在文本挖掘任务中具有巨大的应用。实际上,所提出的模型是一个通用框架,可以应用于任何需要在半监督学习环境中识别看不见的类的应用程序。该模型已对由基准20Newsgroups数据集创建的不同文本数据集进行了广泛的经验分析。实验结果表明了该方法的功能以及未来研究的可能性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号