首页> 外文会议>IJCNLP 2011 >Extract Chinese Unknown Words from a Large-scale Corpus Using Morphological and Distributional Evidences
【24h】

Extract Chinese Unknown Words from a Large-scale Corpus Using Morphological and Distributional Evidences

机译:使用形态和分布证据从大规模语料库中提取中文未知词

获取原文

摘要

The representative method of using morphological evidence for Chinese unknown word (UW) extraction is Chinese word segmentation (CWS) model, and the method of using distributional evidence for UW extraction is accessor variety (AV) criterion. However, neither of these methods has been verified on large-scale corpus. In this paper, we propose extensions to remedy the drawbacks of these two methods to handle large-scale corpus:(1) for CWS, we propose a generalized definition of word to improve the recall; and (2) for AV, we propose a restricted version to decrease noise. We carry out experiments on a Chinese Web corpus with approximate 200 billion Chinese characters. Experimental results show that our methods outperform the baselines, and the combination of the two evidences can further improve the performance. Moreover, our methods can also efficiently segment the corpus on the fly, which is especially valuable for processing large-scale corpus.
机译:使用汉语未知词(UW)提取的形态证据的代表性方法是中文字分割(CWS)模型,以及使用UW提取的分配证据的方法是Accessor品种(AV)标准。然而,这些方法都没有在大规模的语料库上验证。在本文中,我们提出了延长来解决这两种方法的缺点来处理大规模语料库:(1)对于CWS,我们提出了一个全面的单词定义来改善召回; (2)对于AV,​​我们提出了一个限制的版本来降低噪音。我们对中国网上语料库进行实验,近似的2000亿汉字。实验结果表明,我们的方法优于基线,两者证据的结合可以进一步提高性能。此外,我们的方法还可以有效地将菌条段分割,这对于处理大规模语料库特别有价值。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号