首页> 外文期刊>ACM transactions on Asian and low-resource language information processing >Native Language Identification of Fluent and Advanced Non-Native Writers
【24h】

Native Language Identification of Fluent and Advanced Non-Native Writers

机译:流利和先进的非本土作家的母语识别

获取原文
获取原文并翻译 | 示例
       

摘要

Native Language Identification (NLI) aims at identifying the native languages of authors by analyzing their text samples written in a non-native language. Most existing studies investigate this task for educational applications such as second language acquisition and require the learner corpora. This article performs NLI in a challenging context of the user-generated-content (UGC) where authors are fluent and advanced non-native speakers of a second language. Existing NLI studies with UGC (i) rely on the content-specific/social-network features and may not be generalizable to other domains and datasets, (ii) are unable to capture the variations of the language-usage-patterns within a text sample, and (iii) are not associated with any outlier handling mechanism. Moreover, since there is a sizable number of people who have acquired non-English second languages due to the economic and immigration policies, there is a need to gauge the applicability of NLI with UGC to other languages. Unlike existing solutions, we define a topic-independent feature space, which makes our solution generalizable to other domains and datasets. Based on our feature space, we present a solution thatmitigates the effect of outliers in the data and helps capture the variations of the language-usagepatterns within a text sample. Specifically, we represent each text sample as a point set and identify the top-k stylistically similar text samples (SSTs) from the corpus. We then apply the probabilistic k nearest neighbors' classifier on the identified top-k SSTs to predict the native languages of the authors. To conduct experiments, we create three new corpora where each corpus is written in a different language, namely, English, French, and German. Our experimental studies show that our solution outperforms competitive methods and reports more than 80% accuracy across languages.
机译:母语识别(NLI)旨在通过分析以非母语编写的文本样本来识别作者的母语。大多数现有研究调查了这项任务,了解第二语言习得等教育应用,并要求学习者进行学习者。本文在用户生成的内容(UGC)的具有挑战性上下文中执行NLI,其中作者是第二语言的流利和高级非母语人员。具有UGC(i)的现有NLI研究依赖于内容特定/社交网络功能,并且可能不可能概括为其他域和数据集,(ii)无法捕获文本示例中语言使用模式的变化((iii)与任何异常处理机制无关。此外,由于有一个大量的人因经济和移民政策而获得非英语第二语言,因此需要将NLI与UGC的适用性衡量为其他语言。与现有解决方案不同,我们定义了一个独立于主题的特征空间,这使我们的解决方案概括为其他域和数据集。基于我们的特征空间,我们提出了一个解决方案,即表示数据中的异常值的影响,并有助于捕获文本样本中的语言UsagePatterns的变体。具体地,我们将每个文本样本代表为点设置,并从语料库中识别顶-k风格上类似的文本样本(SST)。然后,我们将概率K最近邻居的分类器应用于所识别的Top-k SST,以预测作者的母语。要进行实验,我们创建了三个新的Corpora,每个语料库用不同的语言编写,即英语,法语和德语。我们的实验研究表明,我们的解决方案优于竞争方法,并在跨语言报告超过80%的准确性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号