首页> 外文会议>International conference on computational linguistics >Efficient Discrimination Between Closely Related Languages
【24h】

Efficient Discrimination Between Closely Related Languages

机译:密切相关语言之间的有效区分

获取原文

摘要

In this paper, we revisit the problem of language identification with the focus on proper discrimination between closely related languages. Strong similarities between certain languages make it very hard to classify them correctly using standard methods that have been proposed in the literature. Dedicated models that focus on specific discrimination tasks help to improve the accuracy of general-purpose language identification tools. We propose and compare methods based on simple document classification techniques trained on parallel corpora of closely related languages and methods that emphasize discriminating features in terms of blacklisted words. Our experiments demonstrate that these techniques are highly accurate for the difficult task of discriminating between Bosnian, Croatian and Serbian. The best setup yields an absolute improvement of over 9% in accuracy over the best performing baseline using a state-of-the-art language identification tool.
机译:在本文中,我们将重点关注密切相关的语言之间的适当区别,从而重新审视语言识别问题。某些语言之间的强相似性使得很难使用文献中提出的标准方法对它们进行正确分类。专注于特定歧视任务的专用模型有助于提高通用语言识别工具的准确性。我们提出并比较基于在密切相关的语言的平行语料库上训练的简单文档分类技术的方法,以及强调根据黑名单单词区分特征的方法。我们的实验表明,这些技术对于区分波斯尼亚语,克罗地亚语和塞尔维亚语的艰巨任务非常准确。使用最先进的语言识别工具,与最佳性能基准相比,最佳设置绝对可以使准确性绝对提高9%以上。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号