首页> 外文会议>ACM SIGKDD international conference on Knowledge discovery in data mining >On the use of linear programming for unsupervised text classification
【24h】

On the use of linear programming for unsupervised text classification

机译:关于使用线性规划进行无监督文本分类

获取原文

摘要

We propose a new algorithm for dimensionality reduction and unsupervised text classification. We use mixture models as underlying process of generating corpus and utilize a novel, L1-norm based approach introduced by Kleinberg and Sandler [19]. We show that our algorithm performs extremely well on large datasets, with peak accuracy approaching that of supervised learning based on Support Vector Machines (SVMs) with large training sets. The method is based on the same idea that underlies Latent Semantic Indexing (LSI). We find a good low-dimensional subspace of a feature space and project all documents into it. However our projection minimizes different error, and unlike LSI we build a basis, that in many cases corresponds to the actual topics. We present results of testing of our algorithm on the abstracts of arXiv - an electronic repository of scientific papers, and the 20 Newsgroup dataset - a small snapshot of 20 specific newsgroups.
机译:我们提出了一种新的降维和无监督文本分类算法。我们使用混合模型作为生成语料库的基础过程,并利用由Kleinberg和Sandler提出的基于L1范式的新颖方法[19]。我们证明了我们的算法在大型数据集上的表现非常出色,其峰值精度接近具有大量训练集的基于支持向量机(SVM)的监督学习的峰值精度。该方法基于潜在语义索引(LSI)的相同思想。我们找到要素空间的一个良好的低维子空间,并将所有文档投影到其中。但是,我们的预测将不同的误差降到最低,并且与LSI不同,我们建立了一个基础,即在许多情况下对应于实际主题。我们在arXiv(一个科学论文的电子存储库,以及20个新闻组数据集)的摘要上展示了我们算法的测试结果,该摘要是20个特定新闻组的小快照。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号