首页> 外文会议>International Conference on speech and computer >Unsupervised Document Classification and Topic Detection
【24h】

Unsupervised Document Classification and Topic Detection

机译:无监督文档分类和主题检测

获取原文

摘要

This article presents a method for pre-processing the feature vectors representing text documents that are consequently classified using unsupervised methods. The main goal is to show that state-of-the-art classification methods can be improved by a certain data preparation process. The first method is a standard K-means clustering and the second Latent Dirichlet allocation (LDA) method. Both are widely used in text processing. The mentioned algorithms are applied to two data sets in two different languages. First of them, the 20NewsGroup is a widely used benchmark for classification of English documents. The second set was selected from the large body of Czech news articles and was used mainly to compare the performance of the tested methods also for the case of less frequently studied language. Furthermore, the unsupervised methods are also compared with the supervised ones in order to (in some sense) ascertain the upper-bound of the task.
机译:本文介绍了一种用于预处理代表文本文档的特征向量的方法,这些特征文档因此使用无监督方法进行了分类。主要目的是表明可以通过某些数据准备过程来改进最新的分类方法。第一种方法是标准的K均值聚类,第二种是潜在的Dirichlet分配(LDA)方法。两者都广泛用于文本处理中。提到的算法以两种不同的语言应用于两个数据集。首先,20NewsGroup是广泛用于英语文档分类的基准。第二套是从大量捷克新闻中选出的,主要用于比较测试方法的性能,也适用于学习频率较低的情况。此外,还将无监督方法与有监督方法进行比较,以便(在某种意义上)确定任务的上限。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号