首页> 外文期刊>Information Systems >A Two-Stage Machine learning approach for temporally-robust text classification
【24h】

A Two-Stage Machine learning approach for temporally-robust text classification

机译:鲁棒性文本分类的两阶段机器学习方法

获取原文
获取原文并翻译 | 示例
       

摘要

One of the most relevant research topics, in Information Retrieval. is Automatic Document Classification (ADC). Several ADC algorithms have been proposed in the literature. However, the majority of these algorithms assume that the underlying data distribution does not change over time. Previous work has demonstrated evidence of the negative impact of three main temporal effects in representative datasets textual datasets; reflected by variations observed over time in the class distribution, in the pairwise class similarities and in the relationships between terms and classes [1]. In order to minimize the impact of temporal effects in ADC algorithms, we have previouly introduced the notion of a temporal weighting function (TWF), which reflects the varying nature of textual datasets. We have also proposed a procedure to derive the TWF's expression and parameters. However, the derivation of the TWF requires the running of explicit and complex statistical tests, which are very cumbersome or can not even be run in several cases. In this article, we propose a machine learning methodology to, automatically learn the TWF without the need to perform any statistical tests. We also propose new strategies to incorporate the TWF into ADC algorithms, which we call temporally-aware classifiers. Experiments showed that the fully-automated temporally-aware classifiers achieved significant gains (up to 17%) when compared to their non-temporal counterparts, even outperforming some state-of-the-art algorithms (e.g., SVM) in most cases, with large reductions in execution time. (C) 2017 Elsevier Ltd. All rights reserved.
机译:最相关的研究主题之一,在信息检索中。是自动文档分类(ADC)。文献中已经提出了几种ADC算法。但是,这些算法大多数都假定基础数据分布不会随时间变化。先前的工作证明了代表性数据集文本数据集中的三个主要时间效应的负面影响。在类别分布,成对的类别相似性以及术语与类别之间的关系中观察到的随时间变化的结果[1]。为了最大程度地减少时间影响在ADC算法中的影响,我们以前引入了时间加权函数(TWF)的概念,该概念反映了文本数据集的不同性质。我们还提出了导出TWF的表达式和参数的过程。但是,TWF的推导需要运行显式和复杂的统计检验,这非常麻烦,甚至在某些情况下甚至无法运行。在本文中,我们提出了一种机器学习方法,无需执行任何统计测试即可自动学习TWF。我们还提出了将TWF纳入ADC算法的新策略,我们将这些称为时间感知分类器。实验表明,与非时间分类器相比,全自动的时间感知分类器获得了可观的收益(高达17%),即使在大多数情况下,其性能也优于某些最新算法(例如SVM)大大减少了执行时间。 (C)2017 Elsevier Ltd.保留所有权利。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号