首页> 中文期刊> 《电脑与电信》 >基于条件随机场汉语分词的语料规模量化研究

基于条件随机场汉语分词的语料规模量化研究

         

摘要

Chinese word segmentation methods using conditional random fields (CRFs) have been very popular over the years. In the training of CRF, the scale of corpus plays a crucial role that directly influents the model stability and the segmentation precision. But, there is not any instructive conclusion about how to decide the corpus scale. To solve this problem, Chinese word segmentations with toolkit CRF++0.53 on corpus Bakeoff2005 and Bakeoff2006 are performed with different scales. Synchronously, the quantified analysis of the influence of corpus scale on segmentation performance is presented and an instructive conclusion is obtained from these experiments.%近年来,条件随机场在汉语分词领域得到了广泛的应用。在对条件随机场模型进行训练时,训练语料的规模直接影响模型稳定性以及分词的准确率,但目前对语料规模的选取尚无指导性结论。针对上述问题选取Bakeoff2005和Bakeoff2006的一组不同规模的评测语料,使用cRF++0.53工具包实现字串序列词位标注分词,定量分析了训练语料规模对分词性能的影响,得出了基于条件随机场的汉语分词方法中,训练语料规模选取的量化结论。

著录项

相似文献

  • 中文文献
  • 外文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号