首页> 中文期刊> 《计算机工程与设计》 >面向网络语言基于微博语料的新词发现方法

面向网络语言基于微博语料的新词发现方法

         

摘要

为对微博语料中的中文新词进行有效的识别发现,针对微博语料的文本特性,提出一种基于词语互信息模型和外部统计量的新词发现方法.采用互信息统计模型基于候选词内部最小搭配单元向右邻元扩展统计的方法,建立候选词集;针对统计特性、语料特征,进行低频筛选,引入外部统计量的概念进行过滤.该统计方法解决了基于互信息统计模型用于新词发现时只能统计两组成元素的局限性,规避了影响新词发现研究准确性能的N元重叠问题,过滤方法对于包含大量短语句的微博语料用着良好作用,通过实例与对比验证了该方法的有效性.%To effectively identify and discover the Chinese new words in the microblog corpus, according to the text features of the corpus on microblog, a new word discovery method combining mutual information and external statistics was proposed.A new word candidate set was established by adopting mutual information statistical model based on the minimum combination and extending to the right.Based on the statistics and corpus features, the result was obtained according to the threshold value of the low-frequency and the filter method of external statistics.This statistical method solves the limitation of mutual information model that it only based on two elements and avoids the problem of N-gram overlap.Filtering methods is necessary for microblog corpus containing a large number of short phrase sentences.The effectiveness of the research method is verified through example and contrast test.

著录项

相似文献

  • 中文文献
  • 外文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号