Integrating Generative and Discriminative Character-Based Models for Chinese Word Segmentation

KUN WANG; CHENGQING ZONG; KEH-YIH SU

首页> 外文期刊>ACM transactions on Asian language information processing >Integrating Generative and Discriminative Character-Based Models for Chinese Word Segmentation

【24h】

Integrating Generative and Discriminative Character-Based Models for Chinese Word Segmentation

机译：集成基于生成和判别字符的中文分词模型

获取原文

获取原文并翻译 | 示例

开具论文收录证明 >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

Among statistical approaches to Chinese word segmentation, the word-based n-gram (generative) model and the character-based tagging (discriminative) model are two dominant approaches in the literature. The former gives excellent performance for the in-vocabulary (IV) words; however, it handles out-of-vocabulary (OOV) words poorly. On the other hand, though the latter is more robust for OOV words, it fails to deliver satisfactory performance for IV words. These two approaches behave differently due to the unit they use (word vs. character) and the model form they adopt (generative vs. discriminative). In general, character-based approaches are more robust than word-based ones, as the vocabulary of characters is a closed set; and discriminative models are more robust than generative ones, since they can flexibly include all kinds of available information, such as future context. This article first proposes a character-based n-gram model to enhance the robustness of the generative approach. Then the proposed generative model is further integrated with the character-based discrimina-tive model to take advantage of both approaches. Our experiments show that this integrated approach outperforms all the existing approaches reported in the literature. Afterwards, a complete and detailed error analysis is conducted. Since a significant portion of the critical errors is related to numerical/foreign strings, character-type information is then incorporated into the model to further improve its performance. Last, the proposed integrated approach is tested on cross-domain corpora, and a semi-supervised domain adaptation algorithm is proposed and shown to be effective in our experiments.

机译：在中文分词的统计方法中，基于词的n-gram（生成）模型和基于字符的标记（区别）模型是文献中的两种主要方法。前者在词汇（IV）词方面表现出色；但是，它处理词汇外（OOV）字的能力很差。另一方面，尽管后者对OOV字更健壮，但无法为IV字提供令人满意的性能。这两种方法的行为不同，这取决于它们使用的单位（单词与字符）和采用的模型形式（生成式与判别式）。通常，基于字符的方法比基于单词的方法更健壮，因为字符的词汇是一个封闭的集合。判别模型比生成模型更健壮，因为它们可以灵活地包含各种可用信息，例如将来的环境。本文首先提出了一个基于字符的n-gram模型，以增强生成方法的鲁棒性。然后，将提出的生成模型与基于字符的判别模型进一步集成，以利用这两种方法。我们的实验表明，这种集成方法优于文献中报道的所有现有方法。之后，将进行完整而详细的错误分析。由于严重错误的很大一部分与数字/外部字符串有关，因此，将字符类型信息合并到模型中以进一步改善其性能。最后，在跨域语料库上测试了所提出的集成方法，并提出了一种半监督域自适应算法，并证明在我们的实验中是有效的。

著录项

来源
《ACM transactions on Asian language information processing》 |2012年第2期|p.7.1-7.41|共41页
作者
KUN WANG; CHENGQING ZONG; KEH-YIH SU;
展开▼
作者单位

National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences,No. 95, Zhongguancun East Road, Handian District, Beijing,100190, China;

National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences,No. 95, Zhongguancun East Road, Handian District, Beijing,100190, China;

Behavior Design Corporation,2F, No. 5, Industry East Road IV, Science-Based Industrial Park, Hsinchu, Taiwan;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
chinese word segmentation; character-based approach; generative model; discriminative model; model integration; domain adaptation;

机译：中文分词;基于角色的方法;生成模型判别模型模型整合;领域适应;

相似文献

外文文献
中文文献
专利

1. Beyond Bag-of-Words: combining generative and discriminative models for scene categorization [J] . Zhen Li, Kim-Hui Yap Multimedia Tools and Applications . 2014,第3期

机译：超越语言包：将生成模型和判别模型结合起来进行场景分类
2. Hybrid generative-discriminative human action recognition by combining spatiotemporal words with supervised topic models [J] . Hao Sun, Cheng Wang, Boliang Wang Optical engineering . 2011,第2期

机译：时空词与监督主题模型相结合的混合式生成-区分人类动作识别
3. A Unified Character-Based Tagging Framework for Chinese Word Segmentation [J] . HAI ZHAO, CHANG-NING HUANG, MU LI, ACM transactions on Asian language information processing . 2010,第2期

机译：统一的基于字符的中文分词标记框架
4. Joint Decoding for Chinese Word Segmentation and POS Tagging Using Character-Based and Word-Based Discriminative Models [C] . Li Xinxin, Wang Xuan, Yao Lin 2011 International Conference on Asian Language Processing . 2011

机译：基于字符和基于词的判别模型的中文分词和POS标签联合解码
5. Experimental comparison of discriminative learning approaches for Chinese word segmentation. [D] . Song, Dong. 2008

机译：判别学习方法对中文分词的实验比较。
6. Brain Anatomical Structure Segmentation by Hybrid Discriminative/Generative Models [O] . Zhuowen Tu, Katherine L. Narr, Piotr Dollár, -1

机译：混合判别/生成模型的脑解剖结构分割
7. 7Integrating Generative and Discriminative Character-Based Models for Chinese Word Segmentation [O] . 2015

机译：基于生成和判别字符的中文分词模型的整合

Integrating Generative and Discriminative Character-Based Models for Chinese Word Segmentation

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅