【24h】

Taiwanese Corpus collection Via Continuous speech Recognition Tool

机译:通过连续语音识别工具收集台湾语料库

获取原文

摘要

Corpora, in their different forms for different purposes, have been the bases for modern natural language procesing technology. Taiwanese (MinNan), as other language members in the Sino-Tibet family, has bee nmarginalized due to many reasons. One of the conseqeunces of this marginalization is that no standard written script exists, and thus collecting corpus for these languages has been extremely difficult. By (almost) arbitrarily selecting the hanlor written script (mixture of hanzi and roman characters), we are still facing the problem that only few people are capable of phonetically transcribing a given Taiwanese text. On the other hand, reading a Taiwanese text is easier due to the existence of many commonly used hanzi. By recording a person's reading of Taiwanese text, we use a continuous speech recognizer for Taiwanese to automatically transcribe the text, and end up with two kinds of corpora, one in text, one in speech. The accuracy of the automatic phonetic transcription is about 96.05
机译:语料库以其不同形式用于不同目的,已成为现代自然语言处理技术的基础。台湾人(MinNan)和汉藏语系中的其他语言成员一样,由于多种原因而被蜂拥而至。这种边缘化的后果之一是不存在标准的书面文字,因此为这些语言收集语料库非常困难。通过(几乎)任意选择汉罗书面文字(汉子和罗马字符的混合),我们仍然面临着这样的问题:只有很少的人能够用语音来抄录给定的台湾文字。另一方面,由于存在许多常用的汉字,因此阅读台湾文字更加容易。通过记录一个人对台湾文字的阅读情况,我们使用针对台湾人的连续语音识别器自动转录文字,最后得到两种语料,一种是文字,一种是语音。自动语音转录的准确性约为96.05

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号