【24h】

Boosting Bitext Compression

机译:提升BITEXT压缩

获取原文

摘要

Bilingual parallel corpora, also know as bitexts, convey the same information in two different languages. This implies that when modelling bi-texts one can take advantage of the fact that there exists a relation between both texts; the text alignment task allow to establish such relationship. In this paper we propose different approaches that use words and biwords (pairs made of two words, each one from a different text) as representation sym-bolic units. The properties of these approaches are analysed from a statis-tical point of view and tested as a preprocessing step to general purpose compressors. The results obtained suggest interesting conclusions concerning the use of both words and biwords. When encoded models are used as com-pression boosters we achieve compression ratios improving state-of-the-art compressors up to 6.5 percentage points, being up to 40% faster.
机译:双语平行的Corpora,也知道为Bitexts,以两种不同的语言传达相同的信息。这意味着当建模双文本时,可以利用两个文本之间存在关系的事实;文本对齐任务允许建立这种关系。在本文中,我们提出了使用单词和吉语(对由两个单词组成的对,每个来自不同文本的对)的不同方法作为表示对齐禁止单元。这些方法的性质从Statis-TiCE的观点分析并作为预处理步骤测试到通用压缩机。得到的结果表明,有关使用单词和吉语的有趣结论。当编码模型用作Com-Coundion Boosters时,我们实现了最高可达6.5个百分点的最新压缩机的压缩比率,更快高达40%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号