首页> 外文会议>Second workshop on advances in text input methods >An Ensemble Model of Word-based and Character-based Models for Japanese and Chinese Input Method
【24h】

An Ensemble Model of Word-based and Character-based Models for Japanese and Chinese Input Method

机译:日语和中文输入法的基于单词和基于字符的模型的集成模型

获取原文
获取原文并翻译 | 示例

摘要

Since Japanese and Chinese languages have too many characters to be input directly using a standard keyboard, input methods for these languages that enable users to input the characters are required. Recently, input methods based on statistical models have become popular because of their accuracy and ease of maintenance. Most of them adopt word-based models because they utilize word-segmented corpora to train the models. However, such word-based models suffer from unknown words because they cannot convert words correctly which are not in corpora. To handle this problem, we propose a character-based model that enables input methods to convert unknown words by exploiting character-aligned corpora automatically generated by a monotonic alignment tool. In addition to the character-based model, we propose an ensemble model of both character-based and word-based models to achieve higher accuracy. The ensemble model combines these two models by linear interpolation. All of these models are based on joint source channel model to utilize rich context through higher order joint n-gram. Experiments on Japanese and Chinese datasets showed that the character-based model performs reasonably and the ensemble model outperforms the word-based baseline model. As a future work, the effectiveness of incorporating large raw data should be investigated.
机译:由于日语和中文的字符太多,无法使用标准键盘直接输入,因此需要使用户能够输入字符的这些语言的输入方法。近来,基于统计模型的输入方法由于其准确性和易于维护而变得流行。他们中的大多数都采用基于单词的模型,因为他们利用细分词的语料来训练模型。但是,这样的基于单词的模型遭受未知单词的困扰,因为它们无法正确转换不在语料库中的单词。为了解决这个问题,我们提出了一种基于字符的模型,该模型使输入法能够利用单调对齐工具自动生成的字符对齐的语料库来转换未知单词。除了基于字符的模型外,我们还提出了基于字符的模型和基于单词的模型的集成模型,以实现更高的准确性。集成模型通过线性插值将这两个模型结合在一起。所有这些模型都基于联合源通道模型,以通过高阶联合n-gram利用丰富的上下文。在日文和中文数据集上的实验表明,基于字符的模型表现合理,而集成模型的性能优于基于单词的基线模型。作为未来的工作,应该研究合并大量原始数据的有效性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号