首页> 外文会议>LREC-2012 >Open-Source Boundary-Annotated Corpus for Arabic Speech and Language Processing
【24h】

Open-Source Boundary-Annotated Corpus for Arabic Speech and Language Processing

机译:用于阿拉伯语语音和语言处理的开源边界注释语料库

获取原文

摘要

A boundary-annotated and part-of-speech tagged corpus is a prerequisite for developing phrase break classifiers. Boundary annotations in English speech corpora are descriptive, delimiting intonation units perceived by the listener. We take a novel approach to phrase break prediction for Arabic, deriving our prosodic annotation scheme from Tajwīd (recitation) mark-up in the Qur'an which we then interpret as additional text-based data for computational analysis. This mark-up is prescriptive, and signifies a widely-used recitation style, and one of seven original styles of transmission. Here we report on version 1.0 of our Boundary-Annotated Qur'an dataset of 77430 words and 8230 sentences, where each word is tagged with prosodic and syntactic information at two coarse-grained levels. In (Sawalha et al., 2012), we use the dataset in phrase break prediction experiments. This research is part of a larger-scale project to produce annotation schemes, language resources, algorithms, and applications for Classical and Modem Standard Arabic.
机译:边界注释和致辞标记的语料库是开发短语中断分类器的先决条件。英语语音语料库中的边界注释是描述性的,划定了听众所感知的语调单位。我们采取了一种新的方法来为阿拉伯语进行短语预测,从古兰经中的Tajwīd(朗诵)标记的博物馆注释方案派生,然后我们将我们解释为基于额外的基于文本的数据进行计算分析。此标记是规范性的,并表示广泛使用的朗诵风格,以及七种原始传输风格之一。在这里,我们报告了我们的边界注释的Qur'An数据集的1.0版本为77430个单词和8230个句子,其中每个单词都以两个粗粒度的级别标记为博物馆和句法信息。在(Sawalha等人,2012)中,我们在短语中使用DataSet中断预测实验。该研究是大规模项目的一部分,用于生成古典和调制解调器标准阿拉伯语的注释方案,语言资源,算法和应用程序。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号