首页> 中文期刊>中南民族大学学报(自然科学版) >基于互信息改进算法和t-测试差的壮文分词算法研究

基于互信息改进算法和t-测试差的壮文分词算法研究

     

摘要

The traditional method of Zhuangwen word segmentation is to use the space between words as a separation mark . But in most cases , the word segmentation method will destroy multiple words association combination of semantic words which express the complete and independent semantic information .For the first time we use the mutual information to improve algorithm MI k and t-test difference in Zhuangwen text word segmentation that based on the use of mutual information MI method to measure the degree of correlation between adjacent words , and combine with the two in the evaluation of adjacent words'static binding ability and dynamic binding ability, a TD-MIk hybrid algorithm based on the MIk and t-test difference is proposed .The segmentation effects of MI k , t-test difference and TD-MIk hybrid algorithm are compared .We use the text set on the People′s network in Zhuangwen as a training and test corpus to do the experiments .The experimental results show that the three segmentation methods can extract the semantic words in text accurately and efficiently ,and TD-MIk hybrid algorithm has the highest accuracy of word segmentation .%针对传统的壮文分词方法将单词之间的空格作为分隔标志,在多数情况下,会破坏多个单词关联组合而成的语义词所要表达的完整且独立的语义信息,在借鉴前人使用互信息MI方法来度量相邻单词间关联程度的基础上,首次采用互信息改进算法MIk和t-测试差对壮文文本分词,并结合两者在评价相邻单词间的静态结合能力和动态结合能力的各自优势,提出了一种MIk和t-测试差相结合的TD-MIk混合算法对壮文文本分词,并对互信息改进算法MIk、t-测试差、TD-MIk混合算法三种方法的分词效果进行了比较.使用人民网壮文版上的文本集作为训练及测试语料进行了实验,结果表明:三种分词方法都能够较准确而有效地提取文本中的语义词,并且TD-MIk混合算法的分词准确率最高.

著录项

相似文献

  • 中文文献
  • 外文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号