首页> 外文期刊>ACM transactions on Asian language information processing >Pairwise Comparative Classification for Translator Stylometric Analysis
【24h】

Pairwise Comparative Classification for Translator Stylometric Analysis

机译:笔势分析的成对比较分类

获取原文
获取原文并翻译 | 示例
       

摘要

In this article, we present a new type of classification problem, which we call Comparative Classification Problem (CCP), where we use the term data record to refer to a block of instances. Given a single data record with n instances for n classes, the CCP problem is to map each instance to a unique class. This problem occurs in a wide range of applications where the independent and identically distributed assumption is broken down. The primary difference between CCP and classical classification is that in the latter, the assignment of a translator to one record is independent of the assignment of a translator to a different record. In CCP, however, the assignment of a translator to one record within a block excludes this translator from further assignments to any other record in that block. The interdependency in the data poses challenges for techniques relying on the independent and identically distributed (iid) assumption.In the Pairwise CCP (PWCCP), a pair of records is grouped together. The key difference between PWCCP and classical binary classification problems is that hidden patterns can only be unmasked by comparing the instances as pairs. In this article, we introduce a new algorithm, PWC4.5, which is based on C4.5, to manage PWCCP. We first show that a simple transformation-that we call Gradient-Based Transformation (GBT)— can fix the problem of iid in C4.5. We then evaluate PWC4.5 using two real-world corpora to distinguish between translators on Arabic-English and French-English translations. While the traditional C4.5 failed to distinguish between different translators, GBT demonstrated better performance. Meanwhile, PWC4.5 consistently provided the best results over C4.5 and GBT.
机译:在本文中,我们提出了一种新的分类问题,称为比较分类问题(CCP),在这里我们使用术语数据记录来引用实例块。给定具有n个类的n个实例的单个数据记录,CCP问题是将每个实例映射到唯一的类。在分解独立且分布均匀的假设的广泛应用中,会出现此问题。 CCP与经典分类之间的主要区别在于,在后者中,译者对一个记录的分配独立于译者对不同记录的分配。但是,在CCP中,将翻译器分配给一个块中的一个记录会使该翻译器无法进一步分配给该块中的任何其他记录。数据中的相互依赖性对依赖独立且均匀分布(iid)假设的技术提出了挑战。 r n在成对CCP(PWCCP)中,一对记录被分组在一起。 PWCCP与经典二进制分类问题之间的主要区别在于,只有通过将实例成对比较才能隐藏隐藏模式。在本文中,我们介绍了一种基于C4.5的新算法PWC4.5,用于管理PWCCP。我们首先显示一个简单的转换-我们称为基于渐变的转换(GBT)-可以解决C4.5中的iid问题。然后,我们使用两个真实的语料库评估PWC4.5,以区分阿拉伯语-英语和法语-英语翻译的翻译者。传统的C4.5无法区分不同的翻译器,而GBT表现出更好的性能。同时,PWC4.5始终提供优于C4.5和GBT的最佳结果。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号