首页> 外文OA文献 >Word Segmentation for Chinese Wikipedia Using N-Gram Mutual Information
【2h】

Word Segmentation for Chinese Wikipedia Using N-Gram Mutual Information

机译:基于N-Gram互信息的中文维基百科分词

摘要

In this paper, we propose an unsupervised segmentation approach, named "n-gram mutual information", or NGMI, which is used to segment Chinese documents into n-character words or phrases, using language statistics drawn from the Chinese Wikipedia corpus. The approach alleviates the tremendous effort that is required in preparing and maintaining the manually segmented Chinese text for training purposes, and manually maintaining ever expanding lexicons. Previously, mutual information was used to achieve automated segmentation into 2-character words. The NGMI approach extends the approach to handle longer n-character words. Experiments with heterogeneous documents from the Chinese Wikipedia collection show good results.
机译:在本文中,我们提出了一种无监督的分割方法,称为“ n-gram互信息”,即NGMI,用于使用中文维基百科语料库中的语言统计信息将中文文档分割为n个字符的单词或短语。这种方法减轻了为培训目的准备和维护手动分割的中文文本以及手动维护不断扩展的词典所需的巨大工作量。以前,相互信息用于自动分割成2个字符的单词。 NGMI方法扩展了该方法以处理更长的n个字符的单词。对中文Wikipedia集合中的异构文档进行的实验显示了良好的结果。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号