首页> 外文期刊>Natural language engineering >The Penn Chinese TreeBank: Phrase structure annotation of a large corpus
【24h】

The Penn Chinese TreeBank: Phrase structure annotation of a large corpus

机译:宾州中文树银行:大型语料库的短语结构注释

获取原文
获取原文并翻译 | 示例

摘要

With growing interest in Chinese Language Processing, numerous NLP tools (e.g., word segmenters, part-of-speech taggers, and parsers) for Chinese have been developed all over the world. However, since no large-scale bracketed corpora are available to the public, these tools are trained on corpora with different segmentation criteria, part-of-speech tagsets and bracketing guidelines, and therefore, comparisons are difficult. As a first step towards addressing this issue, we have been preparing a large bracketed corpus since late 1998. The first two installments of the corpus, 250 thousand words of data, fully segmented, POS-tagged and syntactically bracketed, have been released to the public via LDC (www.1dc.upenn.edu). In this paper, we discuss several Chinese linguistic issues and their implications for our treebank-ing efforts and how we address these issues when developing our annotation guidelines. We also describe our engineering strategies to improve speed while ensuring annotation quality.
机译:随着对中文语言处理的兴趣不断增长,世界各地开发了许多用于中文的NLP工具(例如,分词器,词性标记器和解析器)。但是,由于没有公开的大型括弧式语料库,因此这些工具在语料库上具有不同的分割标准,词性标记集和括弧式准则,因此比较困难。作为解决此问题的第一步,我们自1998年底以来一直在准备一个大型的带括号的语料库。该语料库的前两部分,已完全分割,用POS标签和句法括起来的25万个数据字已发布到了通过LDC(www.1dc.upenn.edu)公开。在本文中,我们讨论了几个中文语言问题及其对我们的银行业务的影响,以及在制定注释准则时如何解决这些问题。我们还描述了在确保标注质量的同时提高速度的工程策略。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号