首页> 外文会议>International conference on language resources and evaluation >The Trilingual ALLEGRA Corpus: Presentation and Possible Use for Lexicon Induction
【24h】

The Trilingual ALLEGRA Corpus: Presentation and Possible Use for Lexicon Induction

机译:三语种ALLEGRA语料库:词汇归纳法的介绍和可能的用途

获取原文

摘要

In this paper, we present a trilingual parallel corpus for German, Italian and Romansh, a Swiss minority language spoken in the canton of Grisons. The corpus called ALLEGRA contains press releases automatically gathered from the website of the cantonal administration of Grisons. Texts have been preprocessed and aligned with a current state-of-the-art sentence aligner. The corpus is one of the first of its kind, and can be of great interest, particularly for the creation of natural language processing resources and tools for Romansh. We illustrate the use of such a trilingual resource for automatic induction of bilingual lexicons, which is a real challenge for under-represented languages. We induce a bilingual lexicon for German-Romansh by phrase alignment and evaluate the resulting entries with the help of a reference lexicon. We then show that the use of the third language of the corpus - Italian - as a pivot language can improve the precision of the induced lexicon, without loss in terms of quality of the extracted pairs.
机译:在本文中,我们介绍了德语,意大利语和罗曼什语的三语种平行语料库,这是格里森斯州使用的一种瑞士少数民族语言。名为ALLEGRA的语料库包含自动从Grisons州政府网站收集的新闻稿。文本已经过预处理,并与当前最新的句子对齐器对齐。语料库是同类中的第一个,并且可能引起极大的兴趣,特别是在为罗曼什语创建自然语言处理资源和工具时。我们说明了如何使用这种三语资源来自动归纳双语词典,这对于代表性不足的语言是一个真正的挑战。我们通过词组对齐导出德语-罗马尼亚语双语词典,并在参考词典的帮助下评估结果条目。然后,我们表明,将语料库的第三种语言(意大利语)用作支点语言可以提高导出词典的精度,而不会损失提取对的质量。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号