Unsupervised Document Classification and Topic Detection

机译：无监督文档分类和主题检测

获取原文

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

This article presents a method for pre-processing the feature vectors representing text documents that are consequently classified using unsupervised methods. The main goal is to show that state-of-the-art classification methods can be improved by a certain data preparation process. The first method is a standard K-means clustering and the second Latent Dirichlet allocation (LDA) method. Both are widely used in text processing. The mentioned algorithms are applied to two data sets in two different languages. First of them, the 20NewsGroup is a widely used benchmark for classification of English documents. The second set was selected from the large body of Czech news articles and was used mainly to compare the performance of the tested methods also for the case of less frequently studied language. Furthermore, the unsupervised methods are also compared with the supervised ones in order to (in some sense) ascertain the upper-bound of the task.

机译：本文介绍了一种用于预处理代表文本文档的特征向量的方法，这些特征文档因此使用无监督方法进行了分类。主要目的是表明可以通过某些数据准备过程来改进最新的分类方法。第一种方法是标准的K均值聚类，第二种是潜在的Dirichlet分配（LDA）方法。两者都广泛用于文本处理中。提到的算法以两种不同的语言应用于两个数据集。首先，20NewsGroup是广泛用于英语文档分类的基准。第二套是从大量捷克新闻中选出的，主要用于比较测试方法的性能，也适用于学习频率较低的情况。此外，还将无监督方法与有监督方法进行比较，以便（在某种意义上）确定任务的上限。

著录项

来源
《International Conference on speech and computer》|2017年|748-756|共9页
会议地点
作者
Jaromir Novotny; Pavel Ircing;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Text pre-processing; Classification; Evaluation; LDA; K-means;

机译：文本预处理;分类;评估; LDA; K均值;

相似文献

外文文献
中文文献
专利

1. Unsupervised Topic Detection in document collections: an application in marketing and business journals [J] . Reinhold Decker, Soeren W. Scholz International Journal of Business Intelligence and Data Mining . 2007,第3期

机译：文档集中的无监督主题检测：在营销和商业期刊中的应用
2. An Unsupervised Classification Technique for Detection of Flipped Orientations in Document Images [J] . International Journal of Electrical and Computer Engineering . 2016,第5期

机译：用于检测文档图像中翻转方向的无监督分类技术
3. Web document classification using topic modeling based document ranking [J] . Youngseok Lee, Jungwon Cho International Journal of Electrical and Computer Engineering . 2021,第3期

机译：使用基于主题建模的文档排名进行Web文档分类
4. Unsupervised Document Classification and Topic Detection [C] . Jaromir Novotny, Pavel Ircing International Conference on Speech and Computer . 2017

机译：无监督的文档分类和主题检测
5. Unsupervised classification of text documents. [D] . Aparicio Carrasco, Roxana K. 2008

机译：文本文件的无监督分类。
6. Unsupervised Machine Learning of Topics Documented by Nurses about Hospitalized Patients Prior to a Rapid-Response Event [O] . Zfania Tom Korach, Kenrick D. Cato, Sarah A. Collins, 2019

机译：在快速响应事件之前护士记录的主题的无监督机器学习
7. An Unsupervised Classification Technique for Detection of Flipped Orientations in Document Images [O] . Vijayashree CS, Shobha Rani, Vasudev T 2016

机译：一种无监督的分类技术，用于检测文档图像中翻转方向的

Unsupervised Document Classification and Topic Detection

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅