首页> 外文会议>International Conference on speech and computer >Investigating Word Segmentation Techniques for German Using Finite-State Transducers
【24h】

Investigating Word Segmentation Techniques for German Using Finite-State Transducers

机译:使用有限状态换能器研究德语分词技术

获取原文

摘要

Word segmentation plays an important role in speech recognition as a text pre-processing step that helps decrease out-of-vocabulary items and lowers language model perplexity. Segmentation is applied mainly for agglutinative languages, but other morphologically rich languages, such as German, can also benefit from this technique. Using a relatively small, manually collected broadcast corpus of 134k tokens, the current study investigates how Finite-State Transducers (FSTs) can be applied to perform word segmentation in German. It is shown that FSTs incorporating word-formation rules can reach high segmentation performance with 0.97 precision and 0.93 recall rate. It is also shown that FSTs incorporating n-gram models of manually segmented data can reach even higher performance with accuracy and recall rates of 0.97. This result is remarkable considering the fact that the bottom-up approach performs on par with the expert system without requiring explicit knowledge about morphological categories or word formation rules.
机译:分词在语音识别中起着重要的作用,因为它是文本预处理步骤,有助于减少词汇不足的项目并降低语言模型的困惑。分割主要应用于凝集语言,但是其他形态丰富的语言(例如德语)也可以从该技术中受益。当前的研究使用一个相对较小的,手动收集的134k令牌的广播语料库,研究了如何使用有限状态换能器(FST)来执行德语中的分词。结果表明,结合词形成规则的FST可以达到0.97的精度和0.93的查全率。还表明,结合了n-gram手动分段数据模型的FST可以达到更高的性能,准确度和召回率均为0.97。考虑到自下而上的方法与专家系统具有同等的性能,而无需对形态学类别或词形成规则有明确的了解,因此这一结果非常可观。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号