首页> 外文会议>Nordic conference of computational Linguistics >Compiling and Filtering Parlce: An English-Icelandic Parallel Corpus
【24h】

Compiling and Filtering Parlce: An English-Icelandic Parallel Corpus

机译:编译和过滤PARLCE:英国冰岛并行语料库

获取原文

摘要

We present Parlce, a new English-Icelandic parallel corpus. This is the first parallel corpus built for the purposes of language technology development and research for Icelandic, although some Icelandic texts can be found in various other multilingual parallel corpora. We map which Icelandic texts are available for these purposes, collect and filter aligned data, align other bilingual texts we acquired and describe the alignment and filtering processes. After filtering, our corpus includes 39 million Icelandic words in 3.5 million segment pairs. We estimate that our filtering process reduced the number of faulty segments in the corpus by more than 60% while only reducing the number of good alignments by approximately 9%.
机译:我们提出了一个新的英国冰岛并行语料库。这是第一个为冰岛语言开发和研究而建立的第一个并行语料库,尽管可以在各种其他多语种平行语料库中找到一些冰岛文本。我们映射哪些冰岛文本可用于这些目的,收集和过滤对齐数据,对齐我们获取的其他双语文本并描述对齐和过滤过程。过滤后,我们的语料库包括3900万冰岛单词,在350万段对。我们估计我们的过滤过程将语料库中的故障段数减少超过60%,同时仅将良好对准的数量减少约9%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号