A Linguistically Inspired Statistical Model for Chinese Punctuation Generation

YUQING GUO; HAIFENG WANG; JOSEF VAN GENABITH

首页> 外文期刊>ACM transactions on Asian language information processing >A Linguistically Inspired Statistical Model for Chinese Punctuation Generation

【24h】

A Linguistically Inspired Statistical Model for Chinese Punctuation Generation

机译：语言启发的汉语标点符号生成统计模型

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

This article investigates a relatively underdeveloped subject in natural language processing-the generation of punctuation marks. From a theoretical perspective, we study 16 Chinese punctu-ation marks as denned in the Chinese national standard of punctuation usage, and categorize these punctuation marks into three different types according to their syntactic properties. We implement a three-tier maximum entropy model incorporating linguistically-motivated features for generating the commonly used Chinese punctuation marks in unpunctuated sentences out-put by a surface realizer. Furthermore, we present a method to automatically extract cue words indicating sentence-final punctuation marks as a specialized feature to construct a more precise model. Evaluating on the Penn Chinese Treebank data, the MaxEnt model achieves an f-score of 79.83% for punctuation insertion and 74.61% for punctuation restoration using gold data input, 79.50% for insertion and 73.32% for restoration using parser-based imperfect input. The experi-ments show that the MaxEnt model significantly outperforms a baseline 5-gram language model that scores 54.99% for punctuation insertion and 52.01% for restoration. We show that our results are not far from human performance on the same task with human insertion f-scores in the range of 81-87% and human restoration in the range of 71-82%. Finally, a manual error analysis of the generation output shows that close to 40% of the mismatched punctuation marks do in fact result in acceptable choices, a fact obscured in the automatic string-matching based evaluation scores.

机译：本文研究了自然语言处理中相对欠发达的主题-标点符号的生成。从理论的角度，我们研究了16种在中国标点符号使用国家标准中定义的中文标点符号，并根据其句法特性将这些标点符号分为三种类型。我们实现了一个三层最大熵模型，该模型结合了语言动机特征，用于在表面实现器输出的未标点句子中生成常用的中文标点符号。此外，我们提出了一种自动提取指示句子-最终标点符号的提示词作为特殊功能的方法，以构建更精确的模型。通过对Penn Chinese Treebank数据的评估，使用金数据输入，MaxEnt模型的标点插入率为79.83％，标点还原的分数为74.61％，插入使用基于解析器的不完美输入的分数为79.50％，还原率为73.32％。实验表明，MaxEnt模型明显优于基准5克语言模型，该模型在标点插入方面的得分为54.99％，在标点符号的得分为52.01％。我们表明，与人类插入f得分在81-87％的范围内以及人类恢复在71-82％的范围内的相同任务，我们的结果与人类的表现相差不远。最后，对生成的输出进行的手动错误分析显示，实际上有将近40％的不匹配标点符号确实会导致可接受的选择，这一事实在基于自动字符串匹配的评估分数中被掩盖了。

著录项

来源
《ACM transactions on Asian language information processing》 |2010年第2期|p.6.1-6.27|共27页
作者
YUQING GUO; HAIFENG WANG; JOSEF VAN GENABITH;
展开▼
作者单位

Toshiba (China) Research and Development Center, 5/F., Tower W2, Oriental Plaza, Dongcheng District, Beijing, 100738, China;

rnBaidu Campus, No. 10, Shangdi 10th Street, Haidian District, Beijing 100085,China;

rnNCLT/CNGL, School of Computing, Dublin City University, Glasnevin, Dublin 9, Ireland;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Chinese punctuation marks; maximum entropy model; sentence realization;

机译：中文标点符号;最大熵模型句子实现;
入库时间 2022-08-17 13:41:57

相似文献

外文文献
中文文献
专利

1. Information theortic models in statistical linguistics--Part II: Word frequencies and hierarchical structure in language--statistical tests [J] . Balasubrahmanyan V. K., Naranan S. Current science . 1992,第06期

机译：统计语言学中的信息理论模型-第二部分：语言中的单词频率和层次结构-统计测试
2. Cognitive Linguistics–Inspired Empirical Study of Chinese EFL Teaching [J] . Youmei Gao Creative Education . 2011,第4期

机译：认知语言学启发下的中国外语教学实证研究
3. Information theoretic models in statistical linguistics--Part I: A model for word frequencies [J] . Balasubrahmanyan V. K., Naranan S. Current science . 1992,第05期

机译：统计语言学中的信息理论模型-第一部分：词频模型
4. Improvements on punctuation generation inspired linguistic features for Mandarin prosody generation [C] . Chen-Yu Chiang, Yu-Ping Hung, Guan-Ting Liou, International Symposium on Chinese Spoken Language Processing . 2016

机译：标点符号生成的改进启发了普通话韵律生成的语言功能
5. Diverse Linguistic Resources and Multidimensional Identities: A Study of the Linguistic and Identity Repertoires of Second Generation Chinese Americans in New York City. [D] . Wong, Amy Wing-mei. 2015

机译：多样化的语言资源和多维身份：纽约第二代华裔美国人的语言和身份库研究。
6. In silico biologically-inspired modelling of genomic variation generation in surface proteins of Trypanosoma cruzi [O] . Francisco J Azuaje, Jose L Ramirez, Jose F Da Silveira 2007

机译：在计算机上以生物学为基础的克氏锥虫表面蛋白基因组变异生成的建模
7. Punctuation Generation Inspired Linguistic Features for Mandarin Prosody Generation [O] . Chen-Yu Chiang, Yu-Ping Hung, Han-Yun Yeh, 2018

机译：标点符号为普通话发电启发了语言特征

A Linguistically Inspired Statistical Model for Chinese Punctuation Generation

摘要

著录项

相似文献

相关主题

期刊订阅