Using unlabeled data to improve text classification.

机译：使用未标记的数据来改善文本分类。

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

One key difficulty with text classification learning algorithms is that they require many hand-labeled examples to learn accurately. This dissertation demonstrates that supervised learning algorithms that use a small number of labeled examples and many inexpensive unlabeled examples can create high-accuracy text classifiers. By assuming that documents are created by a parametric generative model, Expectation-Maximization (EM) finds local maximum a posteriori models and classifiers from all the data—labeled and unlabeled. These generative models do not capture all the intricacies of text; however on some domains this technique substantially improves classification accuracy, especially when labeled data are sparse.; Two problems arise from this basic approach. First, unlabeled data can hurt performance in domains where the generative modeling assumptions are too strongly violated. In this case the assumptions can be made more representative in two ways: by modeling sub-topic class structure, and by modeling super-topic hierarchical class relationships. By doing so, model probability and classification accuracy come into correspondence, allowing unlabeled data to improve classification performance. The second problem is that even with a representative model, the improvements given by unlabeled data do not sufficiently compensate for a paucity of labeled data. Here, limited labeled data provide EM initializations that lead to low-probability models. Performance can be significantly improved by using active learning to select high-quality initializations, and by using alternatives to EM that avoid low-probability local maxima.

机译：文本分类学习算法的一个关键困难是，它们需要许多手工标记的示例才能准确学习。本文证明，使用少量标记示例和许多便宜的未标记示例的监督学习算法可以创建高精度的文本分类器。通过假定文档是由参数生成模型创建的，期望最大化（EM）可以从标记和未标记的所有数据中找到局部最大值的后验模型和分类器。这些生成模型不能捕获文本的所有复杂性。但是，在某些领域，该技术可大大提高分类准确性，尤其是在标记数据稀疏的情况下。这种基本方法产生两个问题。首先，未标记数据会严重破坏生成建模假设的领域的性能。在这种情况下，可以通过两种方式使假设更具代表性：通过对子主题类结构进行建模，以及通过对超级主题层次类关系进行建模。这样，模型概率和分类精度就可以对应起来，从而允许未标记的数据改善分类性能。第二个问题是，即使是具有代表性的模型，未标记数据所提供的改进也无法充分弥补标记数据的不足。在这里，有限的标记数据提供了导致低概率模型的EM初始化。通过使用主动学习来选择高质量的初始化，以及通过使用EM的替代方法来避免低概率局部最大值，可以显着提高性能。

著录项

作者
Nigam, Kanal Paul.;
展开▼
作者单位

Carnegie Mellon University.;

展开▼
授予单位 Carnegie Mellon University.;
学科 Computer Science.; Statistics.
学位 Ph.D.
年度 2001
页码 124 p.
总页数 124
原文格式 PDF
正文语种 eng
中图分类自动化技术、计算机技术;统计学;
关键词

相似文献

外文文献
中文文献
专利

1. Web Entity Detection for Semi-structured Text Data Records with Unlabeled Data [J] . Chunliang Lu, Lidong Bing, Wai Lam, International journal of computational linguistics and applications . 2013,第2期

机译：具有未标记数据的半结构化文本数据记录的Web实体检测
2. Clustering-based Method for Positive and Unlabeled Text Categorization Enhanced by Improved TFIDF [J] . Lu Liu, Tao Peng Journal of information science and engineering . 2014,第5期

机译：改进的TFIDF增强了基于聚类的正面和无标签文本分类方法
3. Classifying networked text data with positive and unlabeled examples [J] . Li Mei, Pan Shirui, Zhang Yang, Pattern recognition letters . 2016,第jula1期

机译：使用肯定和未标记的示例对网络文本数据进行分类
4. The Use of Unlabeled Data to Improve Supervised Learning for Text Summarization [C] . Massih-Reza Amini, Patrick Gallinari The Twenty-Fifth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Aug 11-15, 2002, Tampere, Finland . 2002

机译：使用无标签数据来改进文本摘要的监督学习
5. Methods for Improving Natural Language Processing Techniques with Linguistic Regularities Extracted from Large Unlabeled Text Corpora [D] . Lucas, Michael Ryan. 2019

机译：提高了大型未标记文本语料库语言规律的自然语言处理技术的方法
6. Exploiting likely-positive and unlabeled data to improve the identification of protein-protein interaction articles [O] . Richard Tzong-Han Tsai, Hsi-Chuan Hung, Hong-Jie Dai, 2008

机译：利用可能阳性和未标记的数据来改进蛋白质相互作用的鉴定
7. The Use of Unlabeled data to Improve Supervised Learning for Text Summarization [O] . Amini Massih-Reza, Gallinari Patrick 2002

机译：使用未标记的数据改善文本摘要的监督学习
8. Using Unlabeled Data to Improve Text Classification [R] . Nigam, K. P. 2001

机译：使用未标记的数据改进文本分类

Using unlabeled data to improve text classification.

摘要

著录项

相似文献

相关主题

期刊订阅