首页> 外文会议>Annual conference of the International Speech Communication Association;INTERSPEECH 2010 >Text Normalization based on Statistical Machine Translation and Internet User Support
【24h】

Text Normalization based on Statistical Machine Translation and Internet User Support

机译:基于统计机器翻译和Internet用户支持的文本规范化

获取原文

摘要

In this paper, we describe and compare systems for text normalization based on statistical machine translation (SMT) methods which are constructed with the support of internet users. Internet users normalize text displayed in a web interface, thereby providing a parallel corpus of normalized and non-normalized text. With this corpus, SMT models are generated to translate non-normalized into normalized text. To build traditional language-specific text normalization systems, knowledge of linguistics as well as established computer skills to implement text normalization rules are required. Our systems are built without profound computer knowledge due to the simple self-explanatory user interface and the automatic generation of the SMT models. Additionally, no inhouse knowledge of the language to normalize is required due to the multilingual expertise of the internet community. All techniques are applied on French texts, crawled with our Rapid Language Adaptation Toolkit [1] and compared through Levenshtein edit distance [2], BLEU score [3], and perplexity.
机译:在本文中,我们描述和比较了基于统计机器翻译(SMT)方法的文本规范化系统,该系统是在互联网用户的支持下构建的。互联网用户对显示在Web界面中的文本进行规范化,从而提供规范化和非规范化文本的并行语料库。使用此语料库,可以生成SMT模型以将非规范化文本转换为规范化文本。为了构建传统的特定于语言的文本规范化系统,需要语言学知识以及已建立的计算机技能来实现文本规范化规则。由于简单的自解释用户界面和SMT模型的自动生成,我们的系统在构建时就没有深厚的计算机知识。此外,由于互联网社区具有多种语言的专业知识,因此无需内部语言即可进行标准化。所有技术均应用于法语文本,并使用我们的快速语言适应工具包[1]进行了爬网,并通过Levenshtein编辑距离[2],BLEU得分[3]和困惑度进行了比较。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号