首页> 外文期刊>Decision support systems >Exploiting poly-lingual documents for improving text categorization effectiveness
【24h】

Exploiting poly-lingual documents for improving text categorization effectiveness

机译:利用多语言文档提高文本分类效率

获取原文
获取原文并翻译 | 示例
       

摘要

With the globalization of business environments and rapid emergence and proliferation of the Internet, organizations or individuals often generate, acquire, and then archive documents written in different languages (i.e., poly-lingual documents). Prevalent document management practice is to use categories to organize this ever-increasing volume of poly-lingual documents for subsequent searches and accesses. Poly-lingual text categorization (PLTC) refers to the automatic learning of text categorization models from a set of preclassified training documents written in different languages and the subsequent assignment of unclassified poly-lingual documents to predefined categories on the basis of the induced text categorization models. Although PLTC can be approached as multiple, independent monolingual text categorization problems, this naive PLTC approach employs only the training documents of the same language to construct a monolingual classifier and thus fails to exploit the opportunity offered by poly-lingual training documents. In this study, we propose a feature-reinforcement-based PLTC (FR-PLTC) technique that takes into account the training documents of all languages when constructing a monolingual classifier for a specific language. Using the independent monolingual text categorization (MnTC) approach as a performance benchmark, the empirical evaluation results show that our proposed FR-PLTC technique achieves higher classification accuracy than the benchmark technique. In addition, our empirical results suggest the superiority of the proposed FR-PLTC technique over its counterpart across a range of training sizes.
机译:随着商业环境的全球化以及互联网的迅速出现和扩散,组织或个人经常生成,获取并存档以不同语言编写的文档(即多语言文档)。普遍的文档管理惯例是使用类别来组织不断增长的多语言文档,以供后续搜索和访问。多语言文本分类(PLTC)是指从一组以不同语言编写的预分类培训文档中自动学习文本分类模型,然后根据归纳的文本分类模型将未分类的多语言文档随后分配给预定义类别。尽管PLTC可以解决多个独立的单语文本分类问题,但是这种朴素的PLTC方法仅使用相同语言的培训文档来构造单语分类器,因此无法利用多语言培训文档提供的机会。在这项研究中,我们提出了一种基于特征增强的PLTC(FR-PLTC)技术,该技术在构造特定语言的单语分类器时考虑了所有语言的培训文档。使用独立的单语文本分类(MnTC)方法作为性能基准,经验评估结果表明,我们提出的FR-PLTC技术比基准技术具有更高的分类精度。此外,我们的经验结果表明,在多种训练规模下,拟议的FR-PLTC技术优于其同类技术。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号