【24h】

A Chinese Word Segmentation Based on Machine Learning

机译:基于机器学习的中文分词

获取原文

摘要

Different from English, there are no interval marks between words in Chinese. Segmenting Chinese text to words is the first job for every kind of Chinese information processing, so Chinese word segmentation is a basal and difficult issue in the field of Chinese information processing. Traditional word segmentation systems have to establish the dictionary and add unknown words out of the dictionary with manual work. This paper proposes a new Chinese word segmentation model which can automatically establish a dictionary, gradually update it and perfect it based on machine learning. Four modules of the machine learning model for Chinese word segmentation system are introduced in detail and some improvements of the algorithms are made on some module to improve system performance. After the test of closed corpus and open corpus, the results show that the method alleviates the workload of building and maintaining the dictionary, furthermore, it resolves the issues of ambiguity processing and unknown words recognition.
机译:与英语不同,中文单词之间没有间隔标记。将中文文本分割为单词是每种中文信息处理的首要工作,因此中文单词分割是中文信息处理领域的基础难题。传统的分词系统必须建立词典,并通过手工工作将未知单词添加到词典之外。本文提出了一种新的中文分词模型,该模型可以自动建立字典,然后逐步更新并在机器学习的基础上对其进行完善。详细介绍了中文分词系统的机器学习模型的四个模块,并对某些模块进行了算法改进,以提高系统性能。通过对封闭语料库和开放语料库的测试,结果表明该方法减轻了词典的建立和维护工作量,解决了歧义处理和未知词识别的问题。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号