Chinese word segmentation methods using conditional random fields (CRFs) have been very popular over the years. In the training of CRF, the scale of corpus plays a crucial role that directly influents the model stability and the segmentation precision. But, there is not any instructive conclusion about how to decide the corpus scale. To solve this problem, Chinese word segmentations with toolkit CRF++0.53 on corpus Bakeoff2005 and Bakeoff2006 are performed with different scales. Synchronously, the quantified analysis of the influence of corpus scale on segmentation performance is presented and an instructive conclusion is obtained from these experiments.%近年来,条件随机场在汉语分词领域得到了广泛的应用。在对条件随机场模型进行训练时,训练语料的规模直接影响模型稳定性以及分词的准确率,但目前对语料规模的选取尚无指导性结论。针对上述问题选取Bakeoff2005和Bakeoff2006的一组不同规模的评测语料,使用cRF++0.53工具包实现字串序列词位标注分词,定量分析了训练语料规模对分词性能的影响,得出了基于条件随机场的汉语分词方法中,训练语料规模选取的量化结论。
展开▼