首页> 外文会议>20th International Conference on Computer Processing of Oriental Languages; Aug 4-6, 2003; Shenyang, China >A Statistical Approach to Extract Chinese Chunk Candidates from Large Corpora
【24h】

A Statistical Approach to Extract Chinese Chunk Candidates from Large Corpora

机译:从大型语料库中提取中国大块候选人的统计方法

获取原文
获取原文并翻译 | 示例

摘要

The extraction of Chunk candidates from real corpora is one of the fundamental tasks of building example-based machine translation model. This paper presents a statistical approach to extract Chinese chunk candidates from large monolingual corpora. The first step is to extract large N-grams (up to 20-gram) from raw corpus. Then two newly proposed Fast Statistical Substring Reduction (FSSR) algorithms can be applied to the initial N-gram set to remove some unnecessary N-grams using their frequency information. The two algorithms are efficient (both have a time complexity of O(n)) and can effectively reduce the size of N-gram set up to 50%. Finally, mutual information is used to obtain chunk candidates from reduced N-gram set. Perhaps the biggest contribution of this paper is that it is the first time to apply Fast Statistical Substring Reduction algorithm to large corpora and demonstrate the effectiveness and efficiency of this algorithm which, in our hope, will shed new light on large scale corpus oriented research. Experiments on three corpora with different sizes show that this method can extract chunk candidates from corpora of giga bytes efficiently under current computational power. We get an extraction accuracy of 86.3% from People Daily 2000 news corpus.
机译:从真实语料库中提取块候选对象是构建基于示例的机器翻译模型的基本任务之一。本文提出了一种从大型单语语料库中提取中文候选词的统计方法。第一步是从原始语料库中提取大N克(最多20克)。然后,可以将两种新提出的快速统计子串缩减(FSSR)算法应用于初始N-gram集,以使用它们的频率信息来删除一些不必要的N-gram。两种算法都很有效(两者的时间复杂度均为O(n)),并且可以有效地将N-gram的大小设置为50%。最后,使用互信息从简化的N-gram集中获取块候选。也许本文最大的贡献是,这是第一次将快速统计子串约简算法应用于大型语料库,并证明了该算法的有效性和效率,我们希望这将为面向大规模语料库的研究提供新的思路。对三种不同大小的语料库进行的实验表明,该方法可以在当前计算能力下,从千兆字节的语料库中高效地提取候选数据块。我们从People Daily 2000新闻语料库中提取的准确性为86.3%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号