首页> 外文期刊>Information Processing & Management >Using the revised EM algorithm to remove noisy data for improving the one-against-the-rest method in binary text classification
【24h】

Using the revised EM algorithm to remove noisy data for improving the one-against-the-rest method in binary text classification

机译:使用改进的EM算法去除噪声数据,以改进二进制文本分类中的“一对一休息”方法

获取原文
获取原文并翻译 | 示例
       

摘要

Automatic text classification is the problem of automatically assigning predefined categories to free text documents, thus allowing for less manual labors required by traditional classification methods. When we apply binary classification to multi-class classification for text classification, we usually use the one-against-the-rest method. In this method, if a doc-ument belongs to a particular category, the document is regarded as a positive example of that category; otherwise, the document is regarded as a negative example. Finally, each category has a positive data set and a negative data set. But, this one-against-the-rest method has a problem. That is, the documents of a negative data set are not labeled manually, while those of a positive set are labeled by human. Therefore, the negative data set probably includes a lot of noisy data. In this paper, we propose that the sliding window technique and the revised EM (Expectation Maximization) algorithm are applied to binary text classification for solving this problem. As a result, we can improve binary text classification through extracting potentially noisy documents from the negative data set using the sliding window technique and removing actu-ally noisy documents using the revised EM algorithm. The results of our experiments showed that our method achieved better performance than the original one-against-the-rest method in all the data sets and all the classifiers used in the experiments.
机译:自动文本分类是将预定义类别自动分配给自由文本文档的问题,因此可以减少传统分类方法所需的体力劳动。当我们将二进制分类应用于文本分类的多分类时,我们通常使用“休息时反对”的方法。在这种方法中,如果文档属于特定类别,则该文档被视为该类别的肯定示例;否则,该文档将被视为负面示例。最后,每个类别都有一个正数据集和一个负数据集。但是,这种“其余一站式”的方法存在问题。也就是说,否定数据集的文档不是手动标记的,而肯定数据集的文档是人工标记的。因此,负数据集可能包含大量噪声数据。在本文中,我们提出将滑动窗口技术和改进的EM(期望最大化)算法应用于二进制文本分类以解决此问题。结果,我们可以通过使用滑动窗口技术从负面数据集中提取可能有噪声的文档,并使用经过改进的EM算法去除实际有噪声的文档来改善二进制文本分类。实验结果表明,在实验中使用的所有数据集和所有分类器中,我们的方法均比原始的“一站式”统计方法具有更好的性能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号