【24h】

A Parallel Learning Algorithm for Text Classification

机译:文本分类的并行学习算法

获取原文
获取外文期刊封面目录资料

摘要

Text classification is the process of classifying documents into predefined categories based on their content. Existing supervised learning algorithms to automatically classify text need sufficient labeled documents to learn accurately. Applying the Expectation-Maximization (EM) algorithm to this problem is an alternative approach that utilizes a large pool of unlabeled documents to augment the available labeled documents. Unfortunately, the time needed to learn with these large unlabeled documents is too high. This paper introduces a novel parallel learning algorithm for text classification task. The parallel algorithm is based on the combination of the EM algorithm and the naive Bayes classifier. Our goal is to improve the computational time in learning and classifying process. We studied the performance of our parallel algorithm on a large Linux PC cluster called PIRUN Cluster. We report both timing and accuracy results. These results indicate that the proposed parallel algorithm is capable of handling large document collections.
机译:文本分类是根据文档的内容将文档分类为预定义类别的过程。现有的用于自动分类文本的监督学习算法需要足够的带标签文档才能准确学习。将期望最大化(EM)算法应用于此问题是另一种方法,该方法利用大量未标记文档来增加可用的标记文档。不幸的是,学习这些没有标签的大型文档所需的时间太长。本文介绍了一种用于文本分类任务的新型并行学习算法。并行算法基于EM算法和朴素贝叶斯分类器的组合。我们的目标是缩短学习和分类过程中的计算时间。我们在称为PIRUN Cluster的大型Linux PC群集上研究了并行算法的性能。我们同时报告时间和准确性结果。这些结果表明,提出的并行算法能够处理大型文档集合。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号