首页> 外文期刊>IEEE Transactions on Speech and Audio Proceeding >Kanji-to-Hiragana conversion based on a length-constrained n-gram analysis
【24h】

Kanji-to-Hiragana conversion based on a length-constrained n-gram analysis

机译:基于长度受限的n-gram分析的汉字到平假名转换

获取原文
获取原文并翻译 | 示例
           

摘要

A common problem in speech processing is the conversion of the written form of a language to a set of phonetic symbols representing the pronunciation. In this paper, we focus on an aspect of this problem specific to the Japanese language. Written Japanese consists of a mixture of three types of symbols: Kanji, Hiragana, and Katakana. We describe an algorithm for converting conventional Japanese orthography to a Hiragana-like symbol set that closely approximates the most common pronunciation of the text. The algorithm is based on two hypotheses: (1) the correct reading of a Kanji character can be determined by examining a small number of adjacent characters and (2) the number of such combinations required in a dictionary is manageable. The algorithm described here converts the input test by selecting the most probable sequence of orthographic units (n-grams) that can be concatenated to form the input text. In closed-set testing, the n-gram algorithm was shown to provide better performance than several public domain algorithms, achieving a sentence error rate of 3% on a wide range of text material. Though the focus of this paper is written Japanese, the pattern matching algorithm described here has applications to similar problems in other languages.
机译:语音处理中的一个常见问题是将语言的书面形式转换为代表发音的一组语音符号。在本文中,我们专注于日语特定问题的一个方面。日语书面文字由三种类型的符号组成:汉字,平假名和片假名。我们描述了一种将常规的日本拼字法转换为与平假名类似的符号集的算法,该符号集非常接近文本的最常见发音。该算法基于两个假设:(1)汉字字符的正确读取可以通过检查少量相邻字符来确定;(2)词典中所需的此类组合的数量是可管理的。此处描述的算法通过选择可以连接形成输入文本的正字法单位(n-gram)的最可能顺序来转换输入测试。在封闭式测试中,n-gram算法显示出比几种公共领域算法更好的性能,在广泛的文本材料上实现了3%的句子错误率。尽管本文的重点是日语,但此处描述的模式匹配算法可用于其他语言中的类似问题。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号