Native Language Identification of Fluent and Advanced Non-Native Writers

Sarwar Raheem; Rutherford Attapol T.; Hassan Saeed-Ul; Rakthanmanon Thanawin; Nutanong Sarana

首页> 外文期刊>ACM transactions on Asian and low-resource language information processing >Native Language Identification of Fluent and Advanced Non-Native Writers

【24h】

Native Language Identification of Fluent and Advanced Non-Native Writers

机译：流利和先进的非本土作家的母语识别

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Native Language Identification (NLI) aims at identifying the native languages of authors by analyzing their text samples written in a non-native language. Most existing studies investigate this task for educational applications such as second language acquisition and require the learner corpora. This article performs NLI in a challenging context of the user-generated-content (UGC) where authors are fluent and advanced non-native speakers of a second language. Existing NLI studies with UGC (i) rely on the content-specific/social-network features and may not be generalizable to other domains and datasets, (ii) are unable to capture the variations of the language-usage-patterns within a text sample, and (iii) are not associated with any outlier handling mechanism. Moreover, since there is a sizable number of people who have acquired non-English second languages due to the economic and immigration policies, there is a need to gauge the applicability of NLI with UGC to other languages. Unlike existing solutions, we define a topic-independent feature space, which makes our solution generalizable to other domains and datasets. Based on our feature space, we present a solution thatmitigates the effect of outliers in the data and helps capture the variations of the language-usagepatterns within a text sample. Specifically, we represent each text sample as a point set and identify the top-k stylistically similar text samples (SSTs) from the corpus. We then apply the probabilistic k nearest neighbors' classifier on the identified top-k SSTs to predict the native languages of the authors. To conduct experiments, we create three new corpora where each corpus is written in a different language, namely, English, French, and German. Our experimental studies show that our solution outperforms competitive methods and reports more than 80% accuracy across languages.

机译：母语识别（NLI）旨在通过分析以非母语编写的文本样本来识别作者的母语。大多数现有研究调查了这项任务，了解第二语言习得等教育应用，并要求学习者进行学习者。本文在用户生成的内容（UGC）的具有挑战性上下文中执行NLI，其中作者是第二语言的流利和高级非母语人员。具有UGC（i）的现有NLI研究依赖于内容特定/社交网络功能，并且可能不可能概括为其他域和数据集，（ii）无法捕获文本示例中语言使用模式的变化（（iii）与任何异常处理机制无关。此外，由于有一个大量的人因经济和移民政策而获得非英语第二语言，因此需要将NLI与UGC的适用性衡量为其他语言。与现有解决方案不同，我们定义了一个独立于主题的特征空间，这使我们的解决方案概括为其他域和数据集。基于我们的特征空间，我们提出了一个解决方案，即表示数据中的异常值的影响，并有助于捕获文本样本中的语言UsagePatterns的变体。具体地，我们将每个文本样本代表为点设置，并从语料库中识别顶-k风格上类似的文本样本（SST）。然后，我们将概率K最近邻居的分类器应用于所识别的Top-k SST，以预测作者的母语。要进行实验，我们创建了三个新的Corpora，每个语料库用不同的语言编写，即英语，法语和德语。我们的实验研究表明，我们的解决方案优于竞争方法，并在跨语言报告超过80％的准确性。

著录项

来源
《ACM transactions on Asian and low-resource language information processing》 |2020年第4期|55.1-55.19|共19页
作者
Sarwar Raheem; Rutherford Attapol T.; Hassan Saeed-Ul; Rakthanmanon Thanawin; Nutanong Sarana;
展开▼
作者单位

Vidyasirimedhi Inst Sci & Technol Sch Informat Sci & Technol Wangchan Valley 555 Moo 1 Payupnai Wangchan 21210 Rayong Thailand;

Chulalongkorn Univ Dept Linguist Fac Arts Phayathai Rd Bangkok Thailand;

Informat Technol Univ Dept Comp Sci 346-B Ferozepur Rd Lahore Punjab Pakistan;

Vidyasirimedhi Inst Sci & Technol Sch Informat Sci & Technol Wangchan Valley 555 Moo 1 Payupnai Wangchan 21210 Rayong Thailand|Kasetsart Univ Dept Comp Engn 50 Thanon Ngamwongwan Bangkok 10900 Thailand;

Vidyasirimedhi Inst Sci & Technol Sch Informat Sci & Technol Wangchan Valley 555 Moo 1 Payupnai Wangchan 21210 Rayong Thailand;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Author profiling; stylometry; text classification; forensic investigation; native language identification;

机译：作者分析;练习型;文本分类;法医调查;母语识别;
入库时间 2022-08-18 21:31:04

相似文献

外文文献
中文文献
专利

1. Metadiscourse markers in biological research articles and journal impact factor: Non-native writers vs. native writers [J] . Gholami Javad, Ilghami Roghayeh Biochemistry and molecular biology education . 2016,第4期

机译：生物学研究文章中的元话语标记和期刊影响因素：非本地作者与本地作者
2. Vowel identification in temporal-modulated noise for native and non-native listeners: Effect of language experience [J] . Guan Jingjing, Liu Chang, Tao Sha, The Journal of the Acoustical Society of America . 2015,第3aPta1期

机译：本地和非本地听众在时间调制噪声中的元音识别：语言体验的影响
3. Less-Detailed Representation of Non-Native Language: Why Non-Native Speakers' Stories Seem More Vague [J] . Shiri Lev-Ari, Boaz Keysar Discourse Processes . 2012,第7期

机译：不太详尽的非母语表达：为什么非母语者的故事显得更加模糊
4. Figurative Languages Found in Folktales Translated by Native and Non-Native Writers [C] . Ribut Surjowati PRASASTI International Seminar on Linguistics . 2019

机译：由本土和非本土作家翻译的民间专业家中发现的比喻语言
5. Responding to non-native writers of English: The relationship between a teacher's written comments and improvement in second language writing. [D] . Ryoo, Seong Mae. 2013

机译：对非母语英语作者的回应：老师的书面评论与第二语言写作水平的提高之间的关系。
6. Costs and Benefits of Native Language Similarity for Non-native Word Learning [O] . Viorica Marian, James Bartolotti, Aimee van den Berg, 2021

机译：非原生词学习的母语语言相似性的成本和好处
7. The research article: a rhetorical and functional comparison of texts created by native and non-native English writers and native Spanish writers [O] . Sheldon Elena Arts and Media Faculty of Arts Social Sciences UNSW 2013

机译：研究文章：由本地和非本地英语作家与西班牙本地作家创作的文本的修辞和功能比较

Native Language Identification of Fluent and Advanced Non-Native Writers

摘要

著录项

相似文献

相关主题

期刊订阅