首页> 外文期刊>ACM transactions on Asian language information processing >Integrating Generative and Discriminative Character-Based Models for Chinese Word Segmentation
【24h】

Integrating Generative and Discriminative Character-Based Models for Chinese Word Segmentation

机译:集成基于生成和判别字符的中文分词模型

获取原文
获取原文并翻译 | 示例

摘要

Among statistical approaches to Chinese word segmentation, the word-based n-gram (generative) model and the character-based tagging (discriminative) model are two dominant approaches in the literature. The former gives excellent performance for the in-vocabulary (IV) words; however, it handles out-of-vocabulary (OOV) words poorly. On the other hand, though the latter is more robust for OOV words, it fails to deliver satisfactory performance for IV words. These two approaches behave differently due to the unit they use (word vs. character) and the model form they adopt (generative vs. discriminative). In general, character-based approaches are more robust than word-based ones, as the vocabulary of characters is a closed set; and discriminative models are more robust than generative ones, since they can flexibly include all kinds of available information, such as future context. This article first proposes a character-based n-gram model to enhance the robustness of the generative approach. Then the proposed generative model is further integrated with the character-based discrimina-tive model to take advantage of both approaches. Our experiments show that this integrated approach outperforms all the existing approaches reported in the literature. Afterwards, a complete and detailed error analysis is conducted. Since a significant portion of the critical errors is related to numerical/foreign strings, character-type information is then incorporated into the model to further improve its performance. Last, the proposed integrated approach is tested on cross-domain corpora, and a semi-supervised domain adaptation algorithm is proposed and shown to be effective in our experiments.
机译:在中文分词的统计方法中,基于词的n-gram(生成)模型和基于字符的标记(区别)模型是文献中的两种主要方法。前者在词汇(IV)词方面表现出色;但是,它处理词汇外(OOV)字的能力很差。另一方面,尽管后者对OOV字更健壮,但无法为IV字提供令人满意的性能。这两种方法的行为不同,这取决于它们使用的单位(单词与字符)和采用的模型形式(生成式与判别式)。通常,基于字符的方法比基于单词的方法更健壮,因为字符的词汇是一个封闭的集合。判别模型比生成模型更健壮,因为它们可以灵活地包含各种可用信息,例如将来的环境。本文首先提出了一个基于字符的n-gram模型,以增强生成方法的鲁棒性。然后,将提出的生成模型与基于字符的判别模型进一步集成,以利用这两种方法。我们的实验表明,这种集成方法优于文献中报道的所有现有方法。之后,将进行完整而详细的错误分析。由于严重错误的很大一部分与数字/外部字符串有关,因此,将字符类型信息合并到模型中以进一步改善其性能。最后,在跨域语料库上测试了所提出的集成方法,并提出了一种半监督域自适应算法,并证明在我们的实验中是有效的。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号