【24h】

Statistical Substring Reduction in Linear Time

机译:线性时间的统计基因减少

获取原文

摘要

We study the problem of efficiently removing equal frequency n-gram substrings from an n-gram set, formally called Statistical Substring Reduction (SSR). SSR is a useful operation in corpus based multi-word unit research and new word identification task of oriental language processing. We present a new SSR algorithm that has linear time (O(n)), and prove its equivalence with the traditional O(n~2) algorithm. In particular, using experimental results from several corpora with different sizes, we show that it is possible to achieve performance close to that theoretically predicated for this task. Even in a small corpus the new algorithm is several orders of magnitude faster than the O(n~2) one. These results show that our algorithm is reliable and efficient, and is therefore an appropriate choice for large scale corpus processing.
机译:我们研究了从N-GRAM集中有效地去除等频率n-gram子串的问题,正式称为统计基板减少(SSR)。 SSR是基于语料库的多字单元研究和东方语言处理的新单词标识任务的有用操作。 我们介绍了一种具有线性时间的新SSR算法(O(n)),并通过传统的O(n〜2)算法证明其等价。 特别是,使用具有不同尺寸的多个Cotora的实验结果,我们表明可以实现接近理论上预测的性能。 即使在小语料库中,新算法也比O(n〜2)更快的数量级。

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号