首页> 外文期刊>ACM transactions on Asian language information processing >A Linguistically Inspired Statistical Model for Chinese Punctuation Generation
【24h】

A Linguistically Inspired Statistical Model for Chinese Punctuation Generation

机译:语言启发的汉语标点符号生成统计模型

获取原文
获取原文并翻译 | 示例
       

摘要

This article investigates a relatively underdeveloped subject in natural language processing-the generation of punctuation marks. From a theoretical perspective, we study 16 Chinese punctu-ation marks as denned in the Chinese national standard of punctuation usage, and categorize these punctuation marks into three different types according to their syntactic properties. We implement a three-tier maximum entropy model incorporating linguistically-motivated features for generating the commonly used Chinese punctuation marks in unpunctuated sentences out-put by a surface realizer. Furthermore, we present a method to automatically extract cue words indicating sentence-final punctuation marks as a specialized feature to construct a more precise model. Evaluating on the Penn Chinese Treebank data, the MaxEnt model achieves an f-score of 79.83% for punctuation insertion and 74.61% for punctuation restoration using gold data input, 79.50% for insertion and 73.32% for restoration using parser-based imperfect input. The experi-ments show that the MaxEnt model significantly outperforms a baseline 5-gram language model that scores 54.99% for punctuation insertion and 52.01% for restoration. We show that our results are not far from human performance on the same task with human insertion f-scores in the range of 81-87% and human restoration in the range of 71-82%. Finally, a manual error analysis of the generation output shows that close to 40% of the mismatched punctuation marks do in fact result in acceptable choices, a fact obscured in the automatic string-matching based evaluation scores.
机译:本文研究了自然语言处理中相对欠发达的主题-标点符号的生成。从理论的角度,我们研究了16种在中国标点符号使用国家标准中定义的中文标点符号,并根据其句法特性将这些标点符号分为三种类型。我们实现了一个三层最大熵模型,该模型结合了语言动机特征,用于在表面实现器输出的未标点句子中生成常用的中文标点符号。此外,我们提出了一种自动提取指示句子-最终标点符号的提示词作为特殊功能的方法,以构建更精确的模型。通过对Penn Chinese Treebank数据的评估,使用金数据输入,MaxEnt模型的标点插入率为79.83%,标点还原的分数为74.61%,插入使用基于解析器的不完美输入的分数为79.50%,还原率为73.32%。实验表明,MaxEnt模型明显优于基准5克语言模型,该模型在标点插入方面的得分为54.99%,在标点符号的得分为52.01%。我们表明,与人类插入f得分在81-87%的范围内以及人类恢复在71-82%的范围内的相同任务,我们的结果与人类的表现相差不远。最后,对生成的输出进行的手动错误分析显示,实际上有将近40%的不匹配标点符号确实会导致可接受的选择,这一事实在基于自动字符串匹配的评估分数中被掩盖了。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号