首页> 外文会议>International conference on computational linguistics >Native Tongues, Lost and Found: Resources and Empirical Evaluations in Native Language Identification
【24h】

Native Tongues, Lost and Found: Resources and Empirical Evaluations in Native Language Identification

机译:失落与被发现的母语:母语识别中的资源和实证评估

获取原文

摘要

In this paper we present work on the task of Native Language Identification (NLI). We present an alternative corpus to the ICLE which has been used in most work up until now. We believe that our corpus, TOEFL11, is more suitable for the task of NLI and will allow researchers to better compare systems and results. We show that many of the features that have been commonly used in this task generalize to new and larger corpora. In addition, we examine possible ways of increasing current system performance (e.g., additional features and feature combination methods), and achieve overall state-of-the-art results (accuracy of 90.1%) on the ICLE corpus using an ensemble classifier that includes previously examined features and a novel feature (n-gram language models). We also show that training on a large corpus and testing on a smaller one works well, but not vice versa. Finally, we show that system performance varies across proficiency scores.
机译:在本文中,我们介绍了本地语言识别(NLI)任务。我们提供了ICLE的替代语料库,到目前为止,在大多数工作中都使用了该语料库。我们相信我们的语料库TOEFL11更适合NLI的任务,并将使研究人员可以更好地比较系统和结果。我们表明,此任务中常用的许多功能可以推广到新的和较大的语料库。此外,我们研究了提高当前系统性能的可能方法(例如,附加功能和功能组合方法),并使用集成分类器在ICLE语料库上实现总体最新水平的结果(准确性为90.1%)先前检查过的功能和一个新颖的功能(n-gram语言模型)。我们还表明,在大型语料库上进行培训而在较小的语料库上进行测试会很好,但反之则不然。最后,我们证明了系统性能随熟练程度得分的不同而不同。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号