首页> 外文会议>International conference on recent advances in natural language processing >Detecting Clitics Related Orthographic Errors in Turkish
【24h】

Detecting Clitics Related Orthographic Errors in Turkish

机译:检测与土耳其语相关的文献学错误

获取原文

摘要

For the spell correction task, vocabulary based methods have been replaced with methods that take morphological and grammar rules into account. However, such tools are fairly immature, and, worse, nonexistent for many low resource languages. Checking only if a word is well-formed with respect to the morphological rules of a language may produce false negatives due to the ambiguity resulting from the presence of numerous homophonic words. In this work, we propose an approach to detect and correct the "de/da" clitic errors in Turkish text. Our model is a neural sequence tagger trained with a synthetically constructed dataset consisting of positive and negative samples. The model's performance with this dataset is presented according to different word embedding configurations. The model achieved an F_1 score of 86.67% on a synthetically constructed dataset. We also compared the model's performance on a manually curated dataset of challenging samples that proved superior to other spelling correctors with 71 % accuracy compared to the second best (Google Docs) with 34% accuracy.
机译:对于拼写纠正任务,基于词汇的方法已被考虑形态和语法规则的方法所取代。但是,这样的工具还很不成熟,更糟糕的是,对于许多资源匮乏的语言来说,这种工具是不存在的。仅检查一个单词是否相对于一种语言的形态规则来说格式正确,由于存在许多谐音单词而产生的歧义,可能会产生假否定。在这项工作中,我们提出了一种方法来检测和纠正土耳其语文本中的“ de / da”气候错误。我们的模型是一个神经序列标记器,使用由正样本和负样本组成的合成构建的数据集进行训练。根据不同的词嵌入配置,显示了该数据集的模型性能。在合成构建的数据集上,该模型的F_1分数达到86.67%。我们还比较了该模型在具有挑战性的样本的手动精选数据集上的性能,事实证明该模型具有71%的准确性,优于其他拼写校正器,而准确性仅34%的第二好(Google Docs)。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号