首页> 外文OA文献 >Word Segmentation for Chinese Wikipedia Using N-Gram Mutual Information

【2h】

Word Segmentation for Chinese Wikipedia Using N-Gram Mutual Information

机译：基于N-Gram互信息的中文维基百科分词

页面导航

摘要
著录项
相似文献
相关主题

摘要

In this paper, we propose an unsupervised segmentation approach, named "n-gram mutual information", or NGMI, which is used to segment Chinese documents into n-character words or phrases, using language statistics drawn from the Chinese Wikipedia corpus. The approach alleviates the tremendous effort that is required in preparing and maintaining the manually segmented Chinese text for training purposes, and manually maintaining ever expanding lexicons. Previously, mutual information was used to achieve automated segmentation into 2-character words. The NGMI approach extends the approach to handle longer n-character words. Experiments with heterogeneous documents from the Chinese Wikipedia collection show good results.

机译：在本文中，我们提出了一种无监督的分割方法，称为“ n-gram互信息”，即NGMI，用于使用中文维基百科语料库中的语言统计信息将中文文档分割为n个字符的单词或短语。这种方法减轻了为培训目的准备和维护手动分割的中文文本以及手动维护不断扩展的词典所需的巨大工作量。以前，相互信息用于自动分割成2个字符的单词。 NGMI方法扩展了该方法以处理更长的n个字符的单词。对中文Wikipedia集合中的异构文档进行的实验显示了良好的结果。

著录项

作者
Tang Ling-Xiang; Geva Shlomo; Xu Yue; Trotman Andrew;
展开▼
作者单位

展开▼
年度 2009
总页数
原文格式 PDF
正文语种
中图分类

相似文献

外文文献
中文文献
专利

1. Domain-specific Chinese word segmentation using suffix tree and mutual information [J] . Daniel Zeng, Donghua Wei, Michael Chau, Information systems frontiers . 2011,第1期

机译：使用后缀树和互信息的特定领域中文分词
2. Knowledge Expansion Support by Related Search Keyword Generation Based on Wikipedia Category and Pointwise Mutual Information [J] . Saori Kawauchi, Tetsuya Toyota, Hajime Nobuhara Journal of Advanced Computatioanl Intelligence and Intelligent Informatics . 2012,第2a90期

机译：基于维基百科类别和逐点互信息的相关搜索关键字生成对知识扩展的支持
3. Automatic Extraction Of New Words Based On Google News Corpora For Supporting Lexicon-based Chinese Word Segmentation Systems [J] . Chin-Ming Hong, Chih-Ming Chen, Chao-Yang Chiu Expert systems with applications . 2009,第2p2期

机译：基于Google新闻语料库的自动提取新词以支持基于词典的中文分词系统
4. Joint n-gram Chinese language modeling with an application to Chinese word segmentation [C] . He Xin, Ou Zhijian, Sun Jiasong 2012 International Conference on Audio, Language and Image Processing. . 2012

机译：联合n-gram中文语言建模及其在中文分词中的应用
5. Experimental comparison of discriminative learning approaches for Chinese word segmentation. [D] . Song, Dong. 2008

机译：判别学习方法对中文分词的实验比较。
6. The Trade-Off Between Format Familiarity and Word-Segmentation Facilitation in Chinese Reading [O] . Mingjing Chen, Yongsheng Wang, Bingjie Zhao, 2021

机译：中文阅读中格式熟悉与词分割便利的权衡
7. The Research of the Maximum Length n-grams Priority Chinese Word Segmentation Method Based on Corpus Type Frequency Information [O] . Pengyu Lu, Lijun Jin, Bin Jiang 2012

机译：基于语料库型频率信息的最大长度N-GRAMS优先级汉语分割方法的研究

Word Segmentation for Chinese Wikipedia Using N-Gram Mutual Information

摘要

著录项

相似文献

相关主题

期刊订阅