首页> 外文会议>9th International conference on language resources and evaluation >Automatic language identity tagging on word and sentence-level in multilingual text sources: a case-study on Luxembourgish
【24h】

Automatic language identity tagging on word and sentence-level in multilingual text sources: a case-study on Luxembourgish

机译:多语言文本源中单词和句子级别的自动语言身份标记:卢森堡语案例研究

获取原文

摘要

Luxembourgish, embedded in a multilingual context on the divide between Romance and Germanic cultures, remains one of Europe's under-described languages. This is due to the fact that the written production remains relatively low, and linguistic knowledge and resources, such as lexica and pronunciation dictionaries, are sparse. The speakers or writers will frequently switch between Luxembourgish, German, and French, on a per-sentence basis, as well as on a sub-sentence level. In order to build resources like lexicons, and especially pronunciation lexicons, or language models needed for natural language processing tasks such as automatic speech recognition, language used in text corpora should be identified. In this paper, we present the design of a manually annotated corpus of mixed language sentences as well as the tools used to select these sentences. This corpus of difficult sentences was used to test a word-based language identification system. This language identification system was used to select textual data extracted from the web, in order to build a lexicon and language models. This lexicon and language model were used in an Automatic Speech Recognition system for the Luxembourgish language which obtain a 25% WER on the Quaero development data.
机译:卢森堡语在浪漫和日耳曼文化之间的差异中融入了多种语言的内容,仍然是欧洲未曾描述的语言之一。这是由于以下事实:书面产品的数量仍然相对较低,并且语言知识和资源(例如词典和发音词典)很少。演讲者或作家经常会在每个句子的基础上以及在子句层面上在卢森堡语,德语和法语之间进行切换。为了构建诸如词典(尤其是发音词典)之类的资源,或自然语言处理任务(如自动语音识别)所需的语言模型,应识别文本语料库中使用的语言。在本文中,我们介绍了一种人工注释的混合语言句子的语料库的设计以及用于选择这些句子的工具。这套困难的句子集被用来测试基于单词的语言识别系统。该语言识别系统用于选择从Web提取的文本数据,以建立词典和语言模型。该词典和语言模型被用于卢森堡语言的自动语音识别系统中,该系统在Quaero开发数据上获得了25%的WER。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号