...
首页> 外文期刊>Expert Systems with Application >Efficient classification of multi-labeled text streams by clashing
【24h】

Efficient classification of multi-labeled text streams by clashing

机译:通过冲突对多标签文本流进行有效分类

获取原文
获取原文并翻译 | 示例
           

摘要

We present a method for the classification of multi-labeled text documents explicitly designed for data stream applications that require to process a virtually infinite sequence of data using constant memory and constant processing time. Our method is composed of an online procedure used to efficiently map text into a low-dimensional feature space and a partition of this space into a set of regions for which the system extracts and keeps statistics used to predict multi-label text annotations. Documents are fed into the system as a sequence of words, mapped to a region of the partition, and annotated using the statistics computed from the labeled instances colliding in the same region. This approach is referred to as clashing. We illustrate the method in real-world text data, comparing the results with those obtained using other text classifiers. In addition, we provide an analysis about the effect of the representation space dimensionality on the predictive performance of the system. Our results show that the online embedding indeed approximates the geometry of the full corpus-wise TF and TF-IDF space. The model obtains competitive F measures with respect to the most accurate methods, using significantly fewer computational resources. In addition, the method achieves a higher macro-averaged F measure than methods with similar running time. Furthermore, the system is able to learn faster than the other methods from partially labeled streams.
机译:我们提出了一种对多标签文本文档进行分类的方法,该方法是专门为需要使用恒定内存和恒定处理时间来处理几乎无限量的数据序列的数据流应用程序而设计的。我们的方法由一个在线过程组成,该过程用于将文本有效地映射到低维特征空间中,并将该空间划分成一组区域,系统将针对这些区域提取并保留用于预测多标签文本注释的统计信息。文档以单词序列的形式输入到系统中,映射到分区的某个区域,并使用从碰撞在同一区域中的带标签实例计算出的统计信息进行注释。这种方法称为冲突。我们在现实世界的文本数据中说明了该方法,并将结果与​​使用其他文本分类器获得的结果进行了比较。此外,我们提供了有关表示空间维数对系统预测性能的影响的分析。我们的结果表明,在线嵌入确实近似于完整的语料库TF和TF-IDF空间的几何形状。该模型使用最少量的计算资源就最准确的方法获得了竞争性的F量度。另外,与具有相似运行时间的方法相比,该方法实现了更高的宏平均F度量。此外,该系统能够从部分标记的流中比其他方法更快地学习。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号