首页> 外文期刊>IEICE transactions on information and systems >Detecting New Words from Chinese Text Using Latent Semi-CRF Models
【24h】

Detecting New Words from Chinese Text Using Latent Semi-CRF Models

机译:使用潜在的半CRF模型从中文文本中检测新单词

获取原文
           

摘要

Chinese new words and their part-of-speech (POS) are particularly problematic in Chinese natural language processing. With the fast development of internet and information technology, it is impossible to get a complete system dictionary for Chinese natural language processing, as new words out of the basic system dictionary are always being created. A latent semi-CRF model, which combines the strengths of LDCRF (Latent-Dynamic Conditional Random Field) and semi-CRF, is proposed to detect the new words together with their POS synchronously regardless of the types of the new words from the Chinese text without being pre-segmented. Unlike the original semi-CRF, the LDCRF is applied to generate the candidate entities for training and testing the latent semi-CRF, which accelerates the training speed and decreases the computation cost. The complexity of the latent semi-CRF could be further adjusted by tuning the number of hidden variables in LDCRF and the number of the candidate entities from the Nbest outputs of the LDCRF. A new-words-generating framework is proposed for model training and testing, under which the definitions and distributions of the new words conform to the ones existing in real text. Specific features called “Global Fragment Information” for new word detection and POS tagging are adopted in the model training and testing. The experimental results show that the proposed method is capable of detecting even low frequency new words together with their POS tags. The proposed model is found to be performing competitively with the state-of-the-art models presented.
机译:中文新词及其词性(POS)在中文自然语言处理中尤其成问题。随着互联网和信息技术的飞速发展,不可能获得用于自然语言处理的完整系统词典,因为总是在基础系统词典中创建新单词。提出了一种结合LDCRF(潜在动态条件随机场)和semi-CRF优势的潜在半CRF模型,以同步检测新词及其POS,而与中文文本中新词的类型无关无需预先细分。与原始的半CRF不同,LDCRF用于生成候选实体以训练和测试潜在的半CRF,从而加快了训练速度并降低了计算成本。通过调整LDCRF中的隐藏变量的数量和LDCRF的Nbest个输出中的候选实体的数量,可以进一步调整潜在半CRF的复杂性。提出了一种用于模型训练和测试的新单词生成框架,在该框架下,新单词的定义和分布与真实文本中存在的单词和定义一致。在模型训练和测试中采用了用于新词检测和POS标记的称为“全局片段信息”的特定功能。实验结果表明,该方法能够同时检测低频新词及其POS标签。发现所提出的模型与所提供的最新模型具有竞争优势。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号