首页> 外文会议>International conference on computer processing of oriental languages >A Statistical Approach to Extract Chinese Chunk Candidates from Large Corpora
【24h】

A Statistical Approach to Extract Chinese Chunk Candidates from Large Corpora

机译:从大型公司提取中国块候选的统计方法

获取原文

摘要

The extraction of Chunk candidates from real corpora is one of the fundamental tasks of building example-based machine translation model. This paper presents a statistical approach to extract Chinese chunk candidates from large monolingual corpora. The first step is to extract large N-grams (up to 20-gram) from raw corpus. Then two newly proposed Fast Statistical Substring Reduction (FSSR) algorithms can be applied to the initial N-gram set to remove some unnecessary N-grams using their frequency information. The two algorithms are efficient (both have a time complexity of O(n)) and can effectively reduce the size of N-gram set up to 50%. Finally, mutual information is used to obtain chunk candidates from reduced N-gram set. Perhaps the biggest contribution of this paper is that it is the first time to apply Fast Statistical Substring Reduction algorithm to large corpora and demonstrate the effectiveness and efficiency of this algorithm which, in our hope, will shed new light on large scale corpus oriented research. Experiments on three corpora with different sizes show that this method can extract chunk candidates from corpora of giga bytes efficiently under current computational power. We get an extraction accuracy of 86.3% from People Daily 2000 news corpus.
机译:Real Corpora的块候选的提取是构建基于示例的机器翻译模型的基本任务之一。本文提出了一种统计方法,从大型单机语料库中提取中国街道候选人。第一步是从原料中提取大n克(高达20克)。然后,可以将两个新提出的快速统计基因串(FSSR)算法应用于初始N-GRAM集以使用其频率信息去除一些不必要的n-gram。这两种算法是有效的(两者都有O(n)的时间复杂度),可以有效地降低N-GRAM的尺寸,高达50%。最后,互信息用于从减少的n-gram集中获取块候选。也许本文的最大贡献是,这是第一次将快速统计的子字符串减少算法应用于大型语料库,并展示本算法的有效性和效率,这在我们希望中,将在大规模的语料库上进行新的光线。三个具有不同大小的基层的实验表明,该方法可以在当前计算能力下有效地从Giga字节的Corpora提取块候选。我们从2000年每日新闻语料库中获得了86.3%的提取准确性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号