【24h】

The Joy of Parallelism with CzEng 1.0

机译:与czeng 1.0平行的快乐

获取原文

摘要

CzEng 1.0 is an updated release of our Czech-English parallel corpus, freely available for non-commercial research or educational purposes. In this release, we approximately doubled the corpus size, reaching 15 million sentence pairs (about 200 million tokens per language). More importantly, we carefully filtered the data to reduce the amount of non-matching sentence pairs. CzEng 1.0 is automatically aligned at the level of sentences as well as words. We provide not only the plain text representation, but also automatic morphological tags, surface syntactic as well as deep syntactic dependency parse trees and automatic co-reference links in both English and Czech. This paper describes key properties of the released resource including the distribution of text domains, the corpus data formats, and a toolkit to handle the provided rich annotation. We also summarize the procedure of the rich annotation (incl. co-reference resolution) and of the automatic filtering. Finally, we provide some suggestions on exploiting such an automatically annotated sentence-parallel corpus.
机译:CZENG 1.0是我们捷克语平行语料库的更新版本,可自由用于非商业研究或教育目的。在此版本中,我们大约翻了一一尺寸,达到​​1500万句对(每种语言约2亿令牌)。更重要的是,我们仔细过滤了数据以减少非匹配句子对的数量。 Czeng 1.0在句子的水平和单词中自动对齐。我们不仅提供了纯文本表示,而且提供了自动形态标签,表面句法以及深层句法依赖解析树木和英语和捷克语的自动共参考链接。本文介绍了释放资源的关键属性,包括文本域,语料库数据格式和工具包来处理提供的丰富注释。我们还总结了丰富的注释程序(包括共同参考分辨率)和自动过滤。最后,我们提供了一些关于利用此类自动注释的句子语料库的建议。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号