Word Segmentation for Chinese Novels

机译：中国小说的词分割

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Word segmentation is a necessary first step for automatic syntactic analysis of Chinese text. Chinese segmentation is highly accurate on news data, but the accuracies drop significantly on other domains, such as science and literature. For scientific domains, a significant portion of out-of-vocabulary words are domain-specific terms, and therefore lexicons can be used to improve segmentation significantly. For the literature domain, however, there is not a fixed set of domain terms. For example, each novel can contain a specific set of person, organization and location names. We investigate a method for automatically mining common noun entities for each novel using information extraction techniques, and use the resulting entities to improve a state-of-the-art segmentation model for the novel. In particular, we design a novel double-propagation algorithm that mines noun entities together with common contextual patterns, and use them as plug-in features to a model trained on the source domain. An advantage of our method is that no retraining for the segmentation model is needed for each novel, and hence it can be applied efficiently given the huge number of novels on the web. Results on five different novels show significantly improved accuracies, in particular for OOV words.

机译：单词分割是汉语文本自动句法分析的必要第一步。中国细分对新闻数据进行了高度准确的，但在其他领域，诸如科学和文学等域中的准确性显着下降。对于科学域，重要的词汇单词是特定于域的术语，因此词汇可以用于显着改善分段。但是，对于文献域，没有固定的域名。例如，每个小说可以包含特定的人，组织和位置名称。我们调查用于使用信息提取技术自动挖掘普通名词实体的方法，并使用所得实体来改善新颖的最先进的分段模型。特别是，我们设计一种新型的双重传播算法，该算法与常见的上下文模式一起挖掘名词实体，并将它们作为插件功能作为在源域培训的模型中。我们的方法的一个优点是每个小说都不需要为分段模型进行再培训，因此可以在网上赋予大量的小说来有效地应用。结果五种不同小说显示出显着提高的精度，特别是对于OOV字。

著录项

来源
《AAAI Conference on Artificial Intelligence》|2015年||共7页
会议地点
作者
Likun Qiu; Yue Zhang;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类 TP18-53;
关键词

相似文献

外文文献
中文文献
专利

1. Automatic Extraction Of New Words Based On Google News Corpora For Supporting Lexicon-based Chinese Word Segmentation Systems [J] . Chin-Ming Hong, Chih-Ming Chen, Chao-Yang Chiu Expert systems with applications . 2009,第2p2期

机译：基于Google新闻语料库的自动提取新词以支持基于词典的中文分词系统
2. A Chinese word segmentation based on language situation in processing ambiguous words [J] . Zhang MY, Lu ZD, Zou CY Information Sciences: An International Journal . 2004,第3a4期

机译：基于语言环境的歧义词中文分词
3. The Left and Right Context of a Word: Overlapping Chinese Syllable Word Segmentation with Minimal Context [J] . MIKE TIAN-JIAN JIANG, TSUNG-HSIEN LEE, WEN-LIAN HSU ACM transactions on Asian language information processing . 2013,第1期

机译：单词的左右上下文：具有最小上下文的中文音节分词重叠
4. Word Segmentation for Chinese Novels [C] . Likun Qiu, Yue Zhang AAAI Conference on Artificial Intelligence . 2015

机译：中国小说的词分割
5. Experimental comparison of discriminative learning approaches for Chinese word segmentation. [D] . Song, Dong. 2008

机译：判别学习方法对中文分词的实验比较。
6. An examination of portrayals of smoking in graphic novels/comic books: A picture is worth a thousand words [O] . Daisy Houghton, Frank Houghton 2020

机译：对图形小说中吸烟的描绘/漫画书籍：一张图片胜过千言万语
7. Introduction to CKIP Chinese word segmentation system for the first international Chinese Word Segmentation Bakeoff [O] . Wei-yun Ma 2003

机译：CKIP中文分词系统的首次国际分词推广

Word Segmentation for Chinese Novels

摘要

著录项

相似文献

相关主题

期刊订阅