首页> 外文学位 >Using unlabeled data to improve text classification.
【24h】

Using unlabeled data to improve text classification.

机译:使用未标记的数据来改善文本分类。

获取原文
获取原文并翻译 | 示例

摘要

One key difficulty with text classification learning algorithms is that they require many hand-labeled examples to learn accurately. This dissertation demonstrates that supervised learning algorithms that use a small number of labeled examples and many inexpensive unlabeled examples can create high-accuracy text classifiers. By assuming that documents are created by a parametric generative model, Expectation-Maximization (EM) finds local maximum a posteriori models and classifiers from all the data—labeled and unlabeled. These generative models do not capture all the intricacies of text; however on some domains this technique substantially improves classification accuracy, especially when labeled data are sparse.; Two problems arise from this basic approach. First, unlabeled data can hurt performance in domains where the generative modeling assumptions are too strongly violated. In this case the assumptions can be made more representative in two ways: by modeling sub-topic class structure, and by modeling super-topic hierarchical class relationships. By doing so, model probability and classification accuracy come into correspondence, allowing unlabeled data to improve classification performance. The second problem is that even with a representative model, the improvements given by unlabeled data do not sufficiently compensate for a paucity of labeled data. Here, limited labeled data provide EM initializations that lead to low-probability models. Performance can be significantly improved by using active learning to select high-quality initializations, and by using alternatives to EM that avoid low-probability local maxima.
机译:文本分类学习算法的一个关键困难是,它们需要许多手工标记的示例才能准确学习。本文证明,使用少量标记示例和许多便宜的未标记示例的监督学习算法可以创建高精度的文本分类器。通过假定文档是由参数生成模型创建的,期望最大化(EM)可以从标记和未标记的所有数据中找到局部最大值的后验模型和分类器。这些生成模型不能捕获文本的所有复杂性。但是,在某些领域,该技术可大大提高分类准确性,尤其是在标记数据稀疏的情况下。这种基本方法产生两个问题。首先,未标记数据会严重破坏生成建模假设的领域的性能。在这种情况下,可以通过两种方式使假设更具代表性:通过对子主题类结构进行建模,以及通过对超级主题层次类关系进行建模。这样,模型概率和分类精度就可以对应起来,从而允许未标记的数据改善分类性能。第二个问题是,即使是具有代表性的模型,未标记数据所提供的改进也无法充分弥补标记数据的不足。在这里,有限的标记数据提供了导致低概率模型的EM初始化。通过使用主动学习来选择高质量的初始化,以及通过使用EM的替代方法来避免低概率局部最大值,可以显着提高性能。

著录项

  • 作者

    Nigam, Kanal Paul.;

  • 作者单位

    Carnegie Mellon University.;

  • 授予单位 Carnegie Mellon University.;
  • 学科 Computer Science.; Statistics.
  • 学位 Ph.D.
  • 年度 2001
  • 页码 124 p.
  • 总页数 124
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类 自动化技术、计算机技术;统计学;
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号