首页> 外文期刊>ACM Transactions on Information Systems >Funnelling: A New Ensemble Method for Heterogeneous Transfer Learning and Its Application to Cross-Lingual Text Classification
【24h】

Funnelling: A New Ensemble Method for Heterogeneous Transfer Learning and Its Application to Cross-Lingual Text Classification

机译:Funnelling:异类转移学习的一种新的集成方法及其在跨语言文本分类中的应用

获取原文
获取原文并翻译 | 示例

摘要

Cross-lingual Text Classification (CLC) consists of automatically classifying, according to a common set C of classes, documents each written in one of a set of languages L, and doing so more accurately than when "naively" classifying each document via its corresponding language-specific classifier. To obtain an increase in the classification accuracy for a given language, the system thus needs to also leverage the training examples written in the other languages. We tackle "multilabel" CLC via funnelling, a new ensemble learning method that we propose here. Funnelling consists of generating a two-tier classification system where all documents, irrespective of language, are classified by the same (second-tier) classifier. For this classifier, all documents are represented in a common, language-independent feature space consisting of the posterior probabilities generated by first-tier, language-dependent classifiers. This allows the classification of all test documents, of any language, to benefit from the information present in all training documents, of any language. We present substantial experiments, run on publicly available multilingual text collections, in which funnelling is shown to significantly outperform a number of state-of-the-art baselines. All code and datasets (in vector form) are made publicly available.
机译:跨语言文本分类(CLC)包括根据一组通用的类C对每个用一组语言L编写的文档进行自动分类,并且比通过相应的文档“天真”分类每个文档时更准确特定于语言的分类器。为了提高给定语言的分类准确性,系统因此还需要利用以其他语言编写的训练示例。我们通过funnelling处理“多标签” CLC,Funnelling是我们在此提出的一种新的集成学习方法。漏斗包括生成两级分类系统,其中所有文档(不考虑语言)都由同一(第二级)分类器分类。对于此分类器,所有文档均在一个公共的,与语言无关的特征空间中表示,该特征空间由第一层,与语言相关的分类器生成的后验概率组成。这允许对任何语言的所有测试文档进行分类,以从所有语言的所有培训文档中提供的信息中受益。我们目前在公开可用的多语言文本集上进行了大量实验,其中漏斗被证明明显优于许多最新的基准。所有代码和数据集(矢量形式)均公开可用。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号