【24h】

Compressive Feature Learning

机译:压缩特征学习

获取原文

摘要

This paper addresses the problem of unsupervised feature learning for text data. Our method is grounded in the principle of minimum description length and uses a dictionary-based compression scheme to extract a succinct feature set. Specifically, our method finds a set of word k-grams that minimizes the cost of reconstructing the text losslessly. We formulate document compression as a binary optimization task and show how to solve it approximately via a sequence of reweighted linear programs that are efficient to solve and parallelizable. As our method is unsupervised, features may be extracted once and subsequently used in a variety of tasks. We demonstrate the performance of these features over a range of scenarios including unsupervised exploratory analysis and supervised text categorization. Our compressed feature space is two orders of magnitude smaller than the full k-gram space and matches the text categorization accuracy achieved in the full feature space. This dimensionality reduction not only results in faster training times, but it can also help elucidate structure in unsupervised learning tasks and reduce the amount of training data necessary for supervised learning.
机译:本文讨论了文本数据的无监督功能的问题。我们的方法以最小描述长度的原则接地,并使用基于字典的压缩方案来提取简洁的功能集。具体来说,我们的方法发现了一组Word K-GRAM,最小化了无损重建文本的成本。我们将文档压缩作为二进制优化任务制定,并展示如何通过一系列重复的线性程序来解决,这些程序是有效的解决和并行化的。随着我们的方法是无监督的,可以提取一次并随后用于各种任务中的特征。我们展示了这些特征在一系列场景中的性能,包括无监督的探索性分析和监督文本分类。我们的压缩特征空间比全K-GRAM空间小的两个数量级,并匹配完整特征空间中所实现的文本分类精度。这种维数减少不仅导致更快的培训时间,而且还可以帮助阐明在无监督的学习任务中的结构,并减少监督学习所需的培训数据量。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号